Random one for you - I tired this as a distributed...
# fleet-dev
r
Random one for you - I tired this as a distributed query and it exceeded the memory limits of osqueryd:
Copy code
osqueryd worker (18095) stopping: Memory limits exceeded: 336852000
furthermore, the kernel OOM killer got involved and started killing off important services such as Elasticsearch 😬
seems to have also generated a huge amount of load on the hosts
there was a huge amount of iowait, and memory usage shot up so quickly our telemetry got killed before we could see, and the OOM killer started killing everything
Just realised this won’t be pinging anyone because it’s on an automated PR message, so CCing @zwass and @Tomas Touceda as you’ve replied to me in other topics, in case you have some advice? 🙂
t
hi there, TIL about #fleet-dev xD is that CPU usage for fleet? or the host running the query?
r
that was on one of the hosts running the query yeah 🙂
t
yeah, that's the tricky thing with osquery... what CPU/RAM did this host have?
r
I ran the distributed query over ~1400 servers, and the results came in I realised it was showing all instances Log4J across the estate, but it was showing all of them even the patched ones, so I think I just closed the tab. Some questions: 1. If you close the Fleet tab will the distributed query continue running in the background? If so, can you connect back to it in a new browser tab to see the results, or cancel it? 2. If the osquery watchdog kills the osquery worker, does Fleet try to issue the same query again when the worker restarts, or does it mark that as failed?
We had 40 hosts which reacted badly to this query and the OOM killer killed the main service along with things like telegraf (telemetry) and SSH itself.
I can reproduce this - just tried it targeting a single host and with the command
watch -n1 free -m
I can see that it 100MB per second of RAM is consumed, until the system runs out of RAM and everything gets killed
z
😰 It can be a relatively expensive query if there are a lot of JARs running on the system. That seems like more than I'd expect though.
r
yeah 😄
z
If you close the Fleet tab will the distributed query continue running in the background?
No, it will stop running.
r
I’m also confused that it was allowed to consume so much memory by the osquery watchdog there
z
If the osquery watchdog kills the osquery worker, does Fleet try to issue the same query again when the worker restarts, or does it mark that as failed?
Yes. osquery doesn't (currently) expose any mechanism for indicating that the watchdog killed a distributed query. @sharvil is going to be looking into improving the performance monitoring for live queries within osquery, maybe we can also look at detecting watchdog kills of queries?
r
Ok cool, all good to understand 🙂
z
I’m also confused that it was allowed to consume so much memory by the osquery watchdog there
IIRC the watchdog checks the utilization on an interval so there may be a brief lag before it enforces those limits. For the highest level of production safety you may want to configure a max memory limit via cgroups.
r
yeah good idea 🙂
is there a way to do an ANALYZE on this query? I’d love to figure out what it is unhappy about
on its own, the first query runs absolutely fine
just gives a big list of jar files
z
Approx how big is that list? The second query will then run a YARA scan on every one of those files.
r
261 jar files
bit exciting for january 5th 😄
Is it worth putting some sort of disclaimer on that query as being potentially heavy/disruptive to run do you think?