Title
#general
theopolis

theopolis

02/02/2018, 3:18 AM
@jaredl, that's correct, there was no "watchdog_utilization_limit" originally, it was just a "level" where the default was normal
j

jaredl

02/02/2018, 12:50 PM
@theopolis - interesting, so with 2.10, if I set that to,
100
, what does that mean in terms of how long the worker can consume X% of for the CPU before being killed?
theopolis

theopolis

02/02/2018, 5:18 PM
Good question, it’s a bit obtuse I’ll apologize for that up front. But it’s the count of allowed CPU cycles between 3s intervals. 3s is what most programs use to calculate utilization so we adopted that too.
5:19 PM
I’d have to look and see what 100 is exactly, but iirc it’s about 65% of a core
j

jaredl

02/02/2018, 5:25 PM
Interesting, so, would that mean 65% of a core for 12 seconds before watchdog kills the worker? I’m looking at https://github.com/facebook/osquery/blob/2.10.0/osquery/core/watcher.cpp#L333-L346 but this logic has me horribly confused
theopolis

theopolis

02/03/2018, 4:11 AM
yes, CPU utilization is a confusing topic
4:11 AM
what are you trying to achieve?
4:11 AM
maybe we can alter the code to fit your use case
j

jaredl

02/03/2018, 5:27 PM
Currently,
osqueryd
is having to deal with thousands of
connect
and
execve
syscalls for the process/socket_event tag auditing. However, watchdog keeps killing the worker since it spikes to 100% CPU usage for a while (the events come in waves). So, I’m trying to give it some breathing room to actually get through the audit events.
5:28 PM
It’s not a great fix, since it still causes the CPU spikes but it’s better than the worker constantly spinning up, taking 100%, dying, and that processes continuing forever
5:29 PM
Maybe there’s a better way to pull off what I’m trying to accomplish. I tried setting the rate limit via
auditctl -r
but, it doesn’t seem to actually have an effect from what I can see.
9:25 PM
The other idea I just had would be to use cgroups to restrict the CPU usage allowed by osquery. Although, I think the ‘root’ of the issue is it’s kernel audit rules attempting to get all of the connect/bind/execve syscall events.
theopolis

theopolis

02/03/2018, 9:35 PM
@clong and @alessandrogario may have ideas for audit specifically
9:35 PM
but we at FB use cgroups for osquery
j

jaredl

02/03/2018, 9:51 PM
Yeah, I’m actually working with @clong to sort through some of the performance issues we’re seeing w/ audit & osquery.
9:52 PM
@theopolis - Do you disable watchdog that and rely on cgroups to restrict it entirely or use them both?
theopolis

theopolis

02/04/2018, 12:08 AM
I use both
12:08 AM
If you’re testing audit, use the 3.0.0 flag
12:08 AM
Er, tag
j

jaredl

02/04/2018, 12:23 AM
This is actually our production environment w/ 2.10 although I’m looking forward to the audit rewrite work as well