Hello everyone. As we start planning our rollout, we are looking at how best to limit the impact of osquery to our production systems. I've watched this excellent presentation by @zwass, and we plan to leverage pretty much everything he discussed. However, I'm assuming that the data gathered from osquery_schedule is for scheduled queries and not ad-hoc queries. We are planning to schedule as much as possible, but at some point our security team will need to run queries as part of an investigation on a tight timeline and might revert to running something ad-hoc to get results more quickly. Assuming I'm correct in that osquery_schedule won't capture information about these queries, what are others doing to track their performance? Also, I'd love to hear if others have success stories or lessons learned around the items discussed in Zach's talk. Thanks!
10/19/2020, 8:54 PM
What OSes are you deploying to? If they are only Linux systems then I recommend limiting IO using systemd’s built in cgroup controls.
10/19/2020, 8:57 PM
@theopolis, thanks, that's a good call out. We are deploying to a mix of linux and windows, but primarily linux. Did you find that the watchdog was too permissive and large CPU spikes under 5 seconds in duration were causing issues?
10/19/2020, 9:10 PM
Yes and no, I have a lot of thoughts on the subject but not a lot of time to type right now. I’ll get back to you later on tonight.
10/19/2020, 9:19 PM
10/20/2020, 1:13 AM
I like the systemd/cgroups approach for a few reasons. First you can place all of your background (or OS-related) services into a group and say "I don't want this group to take more than N% CPU" as these are usually the lowest priority compared to whatever the machines are serving. You can place osquery into this group.
Second you can be more find grained compared to the watchdog, for example you can set CPU/IO/Mem whereas osquery only sets its ionice to the lowest priority but does not enforce any limits.
The downside is that you may control osquery enough such that the watchdog never kills bad performing queries (good thing!?) and thus you wont know if you have a particularly bad query in the schedule -- however you can mitigate this by reviewing the
statistics every once and a while.
10/20/2020, 4:29 PM
@theopolis, thanks for that feedback, good stuff. Regarding the osquery_schedule info, as you say, if we control performance enough with cgroups, we may not ever trip the blacklist. It seems to me the only mitigation then would be the breakdown of query execution time, which could actually be extended if cgroups are significantly restrictive.
10/20/2020, 5:21 PM
The user and system time would remain constant as a representation of the CPU cost for each query. I think some basic outlier checks would be effective.