Hi folks! I’m not sure this is a right channel for...
# general
a
Hi folks! I’m not sure this is a right channel for my question, but I fight with some kind of bug in Osquery (or not so obvious misconfiguration) for a long time and I need you help. I know that @zwass has really deep understanding of Osquery under-the-hood mechanisms. In some Linux hosts in server infrastructure I see problems with fetching host details and responding ad-hoc queries. At the same time distributed queries (already scheduled and new) return results without any problems. After installing Osquery agent from scratch (this includes removing old Rocks DB) host responds all queries for several hours (may be day) but in unpredictable moment in time it just stops to respond to ad-hoc queries. After that you can’t gather any information from this host using Fleetdm UI (we use 4.31.0 version) , neither host details, nor ad-hoc queries. But such host stays online in Fleetdm panel and responds distributed queries. And only full reinstall of osquery with
/var/osquery/osquery.db
removing helps, but temporarely. I see this problem only on few specific hosts but right now I can’t determine the root cause of the issue. These hosts have the same OS as others and Osquery also has the same version and configuration (managed by Ansible) You can see osquery.flags file here
Copy code
--enroll_secret_path=<secret path>
--tls_hostname=<endpoint>:443
--host_identifier=hostname
--enroll_tls_endpoint=/api/osquery/enroll
--config_plugin=tls
--config_tls_endpoint=/api/osquery/config
--config_refresh=60
--disable_distributed=false
--distributed_plugin=tls
--distributed_interval=30
--distributed_tls_max_attempts=5
--distributed_tls_read_endpoint=/api/osquery/distributed/read
--distributed_tls_write_endpoint=/api/osquery/distributed/write
--logger_plugin=filesystem,tls
--logger_tls_endpoint=/api/osquery/log
--logger_tls_period=10
--logger_tls_max_linesize=20971520
--read_max=209715200
--table_delay=200
--disable_carver=false
--carver_disable_function=false
--carver_compression=true
--carver_start_endpoint=/api/osquery/carve/begin
--carver_continue_endpoint=/api/osquery/carve/block
--carver_block_size=2097152
--disable_extensions=true
--disable_events=true
I tried to play with several of these options but without results. I don’t see rare messages in logs (after verbose logging enabling) after time when host stops to respond except the line
distributed.cpp:248] Removing expired running distributed query: cb8cee4e4232a54914034659b3b073d60d28a4c129b03eebc9dd536debcdec79
I can share logs from osquery for hours-days in DM, if you need them. I will be very grateful for any help to specify and eliminate the problem!
From the logs, I also noticed that at the moment when the Osquery stops responding to ad-hoc requests, it also stops sending requests to the
/api/osquery/distributed/read
and
/api/osquery/distributed/write
endpoints
Small version of logs. Last successful fetch was about
May 19 16:02
And the problems start about
May 19 17:02
s
As a caveat, this is the osquery slack, and while FleetDM is a common bit of software, it is very far from universal. Some of your comments blur the two a bit. That said, I suspect this is on the osquery side, and not the #fleet side.
As I understand the internals, osquery uses different threads for the distributed queries and the scheduled queries. I can imagine one thread wedging, and the other one still going. I think I’ve seen that kind of behavior, but I have not dug into it
That log snippet has a lot of
Found 1 distributed queries marked as denylisted
is that the cause here?
From that log, it looks like the last distributed query it attempted to run was
Copy code
SELECT de.encrypted, m.path FROM disk_encryption de JOIN mounts m ON m
.device_alias = de.name;
Is that consistent across the various times it failed?
r
This strikes me as familiar. We had a problem where a JOIN with disk_encryption table was taking 3 minutes to run. The issue was that some entries in block_devices were very slow to read in GCP (maybe requiring network timeouts). I resolved the issue by adding some logic to the block_devices generator table to look at the requested "name" value in the QueryContext and directly retrieve the specific udev entry for that name. That brought the query time down to 0.1sec.
Also an update to disk_encryption to only specify specific block device if it had name and pass that down into the block_devices subquery.
s
I know there’s some bugs in the
block_devices
and
disk_encryption
tables around handling query context, and joins. It sounds like maybe you fixed them?
r
I mean I made it O(N) instead of O(N^2) where N was big and slow.
s
Yeah. those.
r
Let me see if I can merge my PR onto open source easily and open a PR.
s
May or may not be the OP’s only problem
r
Sure... just sounded familiar so hope it helps.
Oh, hang on
lol
You wrote half of this patch
You wrote the half in disk_encryption... I guess I Just carried the idea over to the block_devices table.
s
Yeah… Though I think my
disk_encryption
is actually partly buggy. a coworker of mine is in the process of reverting partys of that
r
Hmm, ok. Good to know.
s
It’s not so much buggy as not enough to traverse through parent encryption on LVM volumes
a
Hello! I will try to investigate in further next days and give you more context! Thank you for your ideas!