<@U51GTKKCK>, that's correct, there was no "watchd...
# general
t
@jaredl, that's correct, there was no "watchdog_utilization_limit" originally, it was just a "level" where the default was normal
j
@theopolis - interesting, so with 2.10, if I set that to,
100
, what does that mean in terms of how long the worker can consume X% of for the CPU before being killed?
t
Good question, it’s a bit obtuse I’ll apologize for that up front. But it’s the count of allowed CPU cycles between 3s intervals. 3s is what most programs use to calculate utilization so we adopted that too.
I’d have to look and see what 100 is exactly, but iirc it’s about 65% of a core
j
Interesting, so, would that mean 65% of a core for 12 seconds before watchdog kills the worker? I’m looking at https://github.com/facebook/osquery/blob/2.10.0/osquery/core/watcher.cpp#L333-L346 but this logic has me horribly confused
t
yes, CPU utilization is a confusing topic
what are you trying to achieve?
maybe we can alter the code to fit your use case
j
Currently,
osqueryd
is having to deal with thousands of
connect
and
execve
syscalls for the process/socket_event tag auditing. However, watchdog keeps killing the worker since it spikes to 100% CPU usage for a while (the events come in waves). So, I’m trying to give it some breathing room to actually get through the audit events.
It’s not a great fix, since it still causes the CPU spikes but it’s better than the worker constantly spinning up, taking 100%, dying, and that processes continuing forever
Maybe there’s a better way to pull off what I’m trying to accomplish. I tried setting the rate limit via
auditctl -r
but, it doesn’t seem to actually have an effect from what I can see.
The other idea I just had would be to use cgroups to restrict the CPU usage allowed by osquery. Although, I think the ‘root’ of the issue is it’s kernel audit rules attempting to get all of the `connect`/`bind`/`execve` syscall events.
t
@clong and @alessandrogario may have ideas for audit specifically
but we at FB use cgroups for osquery
j
Yeah, I’m actually working with @clong to sort through some of the performance issues we’re seeing w/ audit & osquery.
@theopolis - Do you disable watchdog that and rely on cgroups to restrict it entirely or use them both?
t
I use both
If you’re testing audit, use the 3.0.0 flag
Er, tag
j
This is actually our production environment w/ 2.10 although I’m looking forward to the audit rewrite work as well