Hi folks I m not sure this is a right channel for my questio osquery #general

Hi folks! I’m not sure this is a right channel for...

Artem

05/21/2023, 3:47 PM

Hi folks! I’m not sure this is a right channel for my question, but I fight with some kind of bug in Osquery (or not so obvious misconfiguration) for a long time and I need you help. I know that @zwass has really deep understanding of Osquery under-the-hood mechanisms. In some Linux hosts in server infrastructure I see problems with fetching host details and responding ad-hoc queries. At the same time distributed queries (already scheduled and new) return results without any problems. After installing Osquery agent from scratch (this includes removing old Rocks DB) host responds all queries for several hours (may be day) but in unpredictable moment in time it just stops to respond to ad-hoc queries. After that you can’t gather any information from this host using Fleetdm UI (we use 4.31.0 version) , neither host details, nor ad-hoc queries. But such host stays online in Fleetdm panel and responds distributed queries. And only full reinstall of osquery with

/var/osquery/osquery.db

removing helps, but temporarely. I see this problem only on few specific hosts but right now I can’t determine the root cause of the issue. These hosts have the same OS as others and Osquery also has the same version and configuration (managed by Ansible) You can see osquery.flags file here

Copy code

--enroll_secret_path=<secret path>
--tls_hostname=<endpoint>:443
--host_identifier=hostname
--enroll_tls_endpoint=/api/osquery/enroll
--config_plugin=tls
--config_tls_endpoint=/api/osquery/config
--config_refresh=60
--disable_distributed=false
--distributed_plugin=tls
--distributed_interval=30
--distributed_tls_max_attempts=5
--distributed_tls_read_endpoint=/api/osquery/distributed/read
--distributed_tls_write_endpoint=/api/osquery/distributed/write
--logger_plugin=filesystem,tls
--logger_tls_endpoint=/api/osquery/log
--logger_tls_period=10
--logger_tls_max_linesize=20971520
--read_max=209715200
--table_delay=200
--disable_carver=false
--carver_disable_function=false
--carver_compression=true
--carver_start_endpoint=/api/osquery/carve/begin
--carver_continue_endpoint=/api/osquery/carve/block
--carver_block_size=2097152
--disable_extensions=true
--disable_events=true

I tried to play with several of these options but without results. I don’t see rare messages in logs (after verbose logging enabling) after time when host stops to respond except the line

distributed.cpp:248] Removing expired running distributed query: cb8cee4e4232a54914034659b3b073d60d28a4c129b03eebc9dd536debcdec79

I can share logs from osquery for hours-days in DM, if you need them. I will be very grateful for any help to specify and eliminate the problem!

Artem

05/21/2023, 4:05 PM

From the logs, I also noticed that at the moment when the Osquery stops responding to ad-hoc requests, it also stops sending requests to the

/api/osquery/distributed/read

and

/api/osquery/distributed/write

endpoints

Artem

05/21/2023, 4:28 PM

Small version of logs. Last successful fetch was about

May 19 16:02

And the problems start about

May 19 17:02

osquery_systemd.log

seph

05/22/2023, 1:34 AM

As a caveat, this is the osquery slack, and while FleetDM is a common bit of software, it is very far from universal. Some of your comments blur the two a bit. That said, I suspect this is on the osquery side, and not the #C01DXJL16D8 side.

seph

05/22/2023, 1:36 AM

As I understand the internals, osquery uses different threads for the distributed queries and the scheduled queries. I can imagine one thread wedging, and the other one still going. I think I’ve seen that kind of behavior, but I have not dug into it

seph

05/22/2023, 1:39 AM

That log snippet has a lot of

Found 1 distributed queries marked as denylisted

is that the cause here?

seph

05/22/2023, 1:43 AM

From that log, it looks like the last distributed query it attempted to run was

Copy code

SELECT de.encrypted, m.path FROM disk_encryption de JOIN mounts m ON m
.device_alias = de.name;

Is that consistent across the various times it failed?

Ryan Mack

05/22/2023, 1:58 PM

This strikes me as familiar. We had a problem where a JOIN with disk_encryption table was taking 3 minutes to run. The issue was that some entries in block_devices were very slow to read in GCP (maybe requiring network timeouts). I resolved the issue by adding some logic to the block_devices generator table to look at the requested "name" value in the QueryContext and directly retrieve the specific udev entry for that name. That brought the query time down to 0.1sec.

Ryan Mack

05/22/2023, 1:59 PM

Also an update to disk_encryption to only specify specific block device if it had name and pass that down into the block_devices subquery.

seph

05/22/2023, 2:00 PM

I know there’s some bugs in the

block_devices

and

disk_encryption

tables around handling query context, and joins. It sounds like maybe you fixed them?

Ryan Mack

05/22/2023, 2:00 PM

I mean I made it O(N) instead of O(N^2) where N was big and slow.

seph

05/22/2023, 2:01 PM

Yeah. those.

Ryan Mack

05/22/2023, 2:01 PM

Let me see if I can merge my PR onto open source easily and open a PR.

seph

05/22/2023, 2:01 PM

May or may not be the OP’s only problem

Ryan Mack

05/22/2023, 2:02 PM

Sure... just sounded familiar so hope it helps.

Ryan Mack

05/22/2023, 2:03 PM

Oh, hang on

Ryan Mack

05/22/2023, 2:03 PM

lol

Ryan Mack

05/22/2023, 2:03 PM

You wrote half of this patch

Ryan Mack

05/22/2023, 2:04 PM

You wrote the half in disk_encryption... I guess I Just carried the idea over to the block_devices table.

seph

05/22/2023, 2:06 PM

Yeah… Though I think my

disk_encryption

is actually partly buggy. a coworker of mine is in the process of reverting partys of that

Ryan Mack

05/22/2023, 2:06 PM

Hmm, ok. Good to know.

seph

05/22/2023, 2:07 PM

It’s not so much buggy as not enough to traverse through parent encryption on LVM volumes

Artem

05/26/2023, 12:29 PM

Hello! I will try to investigate in further next days and give you more context! Thank you for your ideas!

8 Views

Open in Slack

Previous Next