I ve got three homegrown extensions deployed to a big fleet osquery #general

I've got three homegrown extensions deployed to a ...

HarlanF

11/04/2021, 8:31 PM

I've got three homegrown extensions deployed to a big fleet. They're used in some schedule queries, and I'm getting that they're initially all loading (with a

Registering extension

line in the osqueryd.INFO file). Then, less than 24h later, 1,000 of the fleet starts reporting "Error executing <pack>: no such table: <table>". An osqueryd restart fixes it. Ideas?

HarlanF

11/04/2021, 9:52 PM

Also, even when I'm getting that error emitted, I can go into osqueryi and access the table fine each time.

seph

11/05/2021, 12:07 AM

Unless you're using

.connect

osqueryi is a separate process and will spawn its own extensions.

👍 1

seph

11/05/2021, 12:07 AM

Sounds like the extension is dying (or being killed). You could use the process table to examine it

HarlanF

11/05/2021, 1:41 AM

@seph, isn't there some sort of oversight on the extension process from osqueryd? Presuming there's still a process alive for the extension, is there anything I could/should look for in

ps auxww

output for it?

seph

11/05/2021, 3:42 AM

extensions run as their own process. You should be able to see them in the ps output.

seph

11/05/2021, 3:43 AM

Is it alive? Is the memory utilization okay? I don’t have any idea about your extension, so this is first principles…

seph

11/05/2021, 3:44 AM

I don’t offhand remember how osquery handles poorly performing extensions. Crashing is likely different than hanging.

Mike Myers

11/05/2021, 6:04 AM

if this is Windows, we just recently fixed a problem with extension loading https://github.com/osquery/osquery/issues/7324

HarlanF

11/05/2021, 3:25 PM

It's Linux, but will comb through some ones that are erring today, through the process table, and will see if I can distinguish bad from good using that. Thanks, both.

HarlanF

11/05/2021, 9:58 PM

@seph, I think we're experiencing a bug without how osquery recovers from watchdog errors when there are multiple extensions.

HarlanF

11/05/2021, 10:25 PM

I think we're experiencing a bug in extensions-handling that's brought on by a watchdog action. Again, we have three extensions we've developed, each being a python-based extension through Swift. ORDERED CHRONOLOGICALLY:

Copy code

/opt/osquery/bin/osqueryd --flagfile /etc/osquery/osquery.flags --config_path /etc/osquery/osquery.conf
 \_ /opt/osquery/bin/osqueryd
 \_ .../bin/python3.8 /usr/lib/osquery/extension1.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3
 \_ .../bin/python3.8 /usr/lib/osquery/extension2.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3
 \_ .../bin/python3.8 /usr/lib/osquery/extension3.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3

This how the processes look when freshly started (to 'ps auxwf'), when all the extensions are performant. When some query (not an extension) hits a watchdog in our environment, the child process (line 2 above) must die and get restarted. In our setup, it's not restarting all the extension processes, and ends up looking like this: ORDERED CHRONOLOGICALLY:

Copy code

/opt/osquery/bin/osqueryd --flagfile /etc/osquery/osquery.flags --config_path /etc/osquery/osquery.conf
 \_ .../bin/python3.8 /usr/lib/osquery/extension1.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3
 \_ .../bin/python3.8 /usr/lib/osquery/extension2.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3
 \_ /opt/osquery/bin/osqueryd
 \_ .../bin/python3.8 /usr/lib/osquery/extension3.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3

At this point, extension 3 works and start time matches the child daemon above it, but extensions 1 & 2 still have timing from the parent process above them. So in a watchdog situation, it'd appear something's hasn't managed to iterate through the extensions and restart them all. If I look up the pid of the erring extensions immediately above, and kill one of them, something restarts it immediately, and it resumes working.

HarlanF

11/06/2021, 12:47 AM

Ah, we're running Thrift 0.13.0, and we're going to try on a box upgrading to Thrift 0.15.0; we'll know Monday I suppose

Mike Myers

11/06/2021, 1:04 AM

We're working on updating Thrift within osquery too. https://github.com/osquery/osquery/pull/7330/commits/ad4a128fb0eaaf811f80e2ee6a6243454da0ec2d Maybe you can build osquery from this branch to test if it resolves the issue you're seeing?

Mike Myers

11/06/2021, 1:04 AM

Or one of us can provide a test build from that branch.

60 Views

Open in Slack

Previous Next