Random one for you I tired this as a distributed query and i osquery #fleet-dev

Random one for you - I tired this as a distributed...

Ryan

01/05/2022, 5:06 PM

Random one for you - I tired this as a distributed query and it exceeded the memory limits of osqueryd:

Copy code

osqueryd worker (18095) stopping: Memory limits exceeded: 336852000

Ryan

01/05/2022, 5:06 PM

furthermore, the kernel OOM killer got involved and started killing off important services such as Elasticsearch 😬

Ryan

01/05/2022, 5:10 PM

seems to have also generated a huge amount of load on the hosts

Ryan

01/05/2022, 5:12 PM

there was a huge amount of iowait, and memory usage shot up so quickly our telemetry got killed before we could see, and the OOM killer started killing everything

Ryan

01/05/2022, 5:18 PM

Just realised this won’t be pinging anyone because it’s on an automated PR message, so CCing @zwass and @Tomas Touceda as you’ve replied to me in other topics, in case you have some advice? 🙂

Tomas Touceda

01/05/2022, 5:25 PM

hi there, TIL about #fleet-dev xD is that CPU usage for fleet? or the host running the query?

Ryan

01/05/2022, 5:26 PM

that was on one of the hosts running the query yeah 🙂

Tomas Touceda

01/05/2022, 5:27 PM

yeah, that's the tricky thing with osquery... what CPU/RAM did this host have?

Ryan

01/05/2022, 5:30 PM

I ran the distributed query over ~1400 servers, and the results came in I realised it was showing all instances Log4J across the estate, but it was showing all of them even the patched ones, so I think I just closed the tab. Some questions: 1. If you close the Fleet tab will the distributed query continue running in the background? If so, can you connect back to it in a new browser tab to see the results, or cancel it? 2. If the osquery watchdog kills the osquery worker, does Fleet try to issue the same query again when the worker restarts, or does it mark that as failed?

Ryan

01/05/2022, 5:30 PM

We had 40 hosts which reacted badly to this query and the OOM killer killed the main service along with things like telegraf (telemetry) and SSH itself.

Ryan

01/05/2022, 5:34 PM

I can reproduce this - just tried it targeting a single host and with the command

watch -n1 free -m

I can see that it 100MB per second of RAM is consumed, until the system runs out of RAM and everything gets killed

zwass

01/05/2022, 5:35 PM

😰 It can be a relatively expensive query if there are a lot of JARs running on the system. That seems like more than I'd expect though.

Ryan

01/05/2022, 5:35 PM

yeah 😄

zwass

01/05/2022, 5:36 PM

If you close the Fleet tab will the distributed query continue running in the background?

No, it will stop running.

Ryan

01/05/2022, 5:36 PM

I’m also confused that it was allowed to consume so much memory by the osquery watchdog there

zwass

01/05/2022, 5:38 PM

If the osquery watchdog kills the osquery worker, does Fleet try to issue the same query again when the worker restarts, or does it mark that as failed?

Yes. osquery doesn't (currently) expose any mechanism for indicating that the watchdog killed a distributed query. @sharvil is going to be looking into improving the performance monitoring for live queries within osquery, maybe we can also look at detecting watchdog kills of queries?

Ryan

01/05/2022, 5:38 PM

Ok cool, all good to understand 🙂

zwass

01/05/2022, 5:40 PM

I’m also confused that it was allowed to consume so much memory by the osquery watchdog there

IIRC the watchdog checks the utilization on an interval so there may be a brief lag before it enforces those limits. For the highest level of production safety you may want to configure a max memory limit via cgroups.

Ryan

01/05/2022, 5:45 PM

yeah good idea 🙂

Ryan

01/05/2022, 5:46 PM

is there a way to do an ANALYZE on this query? I’d love to figure out what it is unhappy about

Ryan

01/05/2022, 5:46 PM

on its own, the first query runs absolutely fine

Ryan

01/05/2022, 5:46 PM

just gives a big list of jar files

zwass

01/05/2022, 5:48 PM

Approx how big is that list? The second query will then run a YARA scan on every one of those files.

Ryan

01/05/2022, 5:49 PM

261 jar files

Ryan

01/05/2022, 5:51 PM

bit exciting for january 5th 😄

Ryan

01/06/2022, 10:45 AM

Is it worth putting some sort of disclaimer on that query as being potentially heavy/disruptive to run do you think?

11 Views

Open in Slack

Previous Next