I've got three homegrown extensions deployed to a ...
# general
h
I've got three homegrown extensions deployed to a big fleet. They're used in some schedule queries, and I'm getting that they're initially all loading (with a
Registering extension
line in the osqueryd.INFO file). Then, less than 24h later, 1,000 of the fleet starts reporting "Error executing <pack>: no such table: <table>". An osqueryd restart fixes it. Ideas?
Also, even when I'm getting that error emitted, I can go into osqueryi and access the table fine each time.
s
Unless you're using
.connect
osqueryi is a separate process and will spawn its own extensions.
đź‘Ť 1
Sounds like the extension is dying (or being killed). You could use the process table to examine it
h
@seph, isn't there some sort of oversight on the extension process from osqueryd? Presuming there's still a process alive for the extension, is there anything I could/should look for in
ps auxww
output for it?
s
extensions run as their own process. You should be able to see them in the ps output.
Is it alive? Is the memory utilization okay? I don’t have any idea about your extension, so this is first principles…
I don’t offhand remember how osquery handles poorly performing extensions. Crashing is likely different than hanging.
m
if this is Windows, we just recently fixed a problem with extension loading https://github.com/osquery/osquery/issues/7324
h
It's Linux, but will comb through some ones that are erring today, through the process table, and will see if I can distinguish bad from good using that. Thanks, both.
@seph, I think we're experiencing a bug without how osquery recovers from watchdog errors when there are multiple extensions.
I think we're experiencing a bug in extensions-handling that's brought on by a watchdog action. Again, we have three extensions we've developed, each being a python-based extension through Swift. ORDERED CHRONOLOGICALLY:
Copy code
/opt/osquery/bin/osqueryd --flagfile /etc/osquery/osquery.flags --config_path /etc/osquery/osquery.conf
 \_ /opt/osquery/bin/osqueryd
 \_ .../bin/python3.8 /usr/lib/osquery/extension1.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3
 \_ .../bin/python3.8 /usr/lib/osquery/extension2.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3
 \_ .../bin/python3.8 /usr/lib/osquery/extension3.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3
This how the processes look when freshly started (to 'ps auxwf'), when all the extensions are performant. When some query (not an extension) hits a watchdog in our environment, the child process (line 2 above) must die and get restarted. In our setup, it's not restarting all the extension processes, and ends up looking like this: ORDERED CHRONOLOGICALLY:
Copy code
/opt/osquery/bin/osqueryd --flagfile /etc/osquery/osquery.flags --config_path /etc/osquery/osquery.conf
 \_ .../bin/python3.8 /usr/lib/osquery/extension1.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3
 \_ .../bin/python3.8 /usr/lib/osquery/extension2.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3
 \_ /opt/osquery/bin/osqueryd
 \_ .../bin/python3.8 /usr/lib/osquery/extension3.ext --socket /var/osquery/osquery.em --timeout 3 --interval 3
At this point, extension 3 works and start time matches the child daemon above it, but extensions 1 & 2 still have timing from the parent process above them. So in a watchdog situation, it'd appear something's hasn't managed to iterate through the extensions and restart them all. If I look up the pid of the erring extensions immediately above, and kill one of them, something restarts it immediately, and it resumes working.
Ah, we're running Thrift 0.13.0, and we're going to try on a box upgrading to Thrift 0.15.0; we'll know Monday I suppose
m
We're working on updating Thrift within osquery too. https://github.com/osquery/osquery/pull/7330/commits/ad4a128fb0eaaf811f80e2ee6a6243454da0ec2d Maybe you can build osquery from this branch to test if it resolves the issue you're seeing?
Or one of us can provide a test build from that branch.