Hrm. I wonder what correct is. I understand that i...
# core
Hrm. I wonder what correct is. I understand that in this case moving to sigkill makes more sense. But what is the flip side? Does that make it more likely to corrupt the local db?
It might, although I would expect in many cases for the issues hitting the limit to have osquery being unresponsive to a nice stop anyway. If we had a bit less code, or had ingrained since the start a collaborative approach, between the table logic and the watchdog, then maybe having some kind of delay and expect the worker to exit cleanly was sort of ok. Another way would be to add two limits, so that there’s the usual limit that can be hit, we give a graceful kill, but if in the next 2 seconds or so the CPU consumption continue or the memory keeps increasing, then we give an immediate kill. This a tad less immediate to implement though.
What I’m mostly worried about is that if you really wanted, you can allocate something like 2GB per second (or more, this is just empirically tested via malloc + memset), so given that the watchdog already has a potential delay of 3 seconds, there’s a lot of space already to allocate a huge amount of memory. It’s true that all that memory it’s unlikely to come from thin air, and probably it comes from reads on the disk, but I mean today we have nvme that have a read speed of GBs, so..
Yes, agreed. In some ways, a 2 limit thing would make sense.
I think I agree with your analysis, but come to a different conclusion. Or at least, a question.
On one hand, we could have a large grace period. This would give osquery the highest chance to save the DB and recover. It’s probably the right thing for some kinds of resource limits On the other hand, we have an out of control process that needs a rapid reaping before it takes down the machine. So I kinda want ask which of the two is more common / which of them we want to support. My gut sense, is that we should stick with the gentle way. There are OS hooks for less gentle limits, and this provides some rope.
But maybe I’m mis understanding some part of this PR
I would say that it should be the opposite. The machine osquery it runs on it’s more important than osquery, because osquery it’s not the reason for that machine to exist, there are other services.
And I’m not sure the OS limits can be set on all platforms. Linux has cgroup which can hard limit CPU and memory, true, not sure what you can do on macOS or Windows though. Also Windows, as mentioned in the PR, never had a graceful shutdown, it terminates the process immediately because it doesn’t have a signal mechanism.
On macOS there’s both ulimit, and something inside launchd.
I’m not sure if the machine or osquery is more important. That’s going to be very site dependent. And I think many sites value the integrity of their unsent logs. (ToB has done a lot to improve that)
I think this is probably the only thing before 5.2.2
I can only speak for us, but if osquery can affect services running, then we can’t run osquery. It’s as simple as that. The server being monitored is definitely more important than osquery itself (for us). We will run osquery on thousands of production servers and if we send a query to all of them that risks taking down even a fraction of them, we just cant run that query. Ideally, it should run the query where it can, and fail on the others.
It’s problematic we can’t predict what behavior will take down servers, so I have a feeling we’d err on the side of caution and just remove osquery altogether. I am sure other users of osquery would be in the same situation. Operations teams wont be happy if the security team mandates a software that ends up affecting production environments and end users. Just a note on the “is osquery’s integrity more important than other services”-discussion.
Having said that, I do understand the reasoning behind the idea that osquery’s integrity is the most important, as the monitoring should ideally never go down. It seems to me this is a big decision that will likely affect a lot of users both ways.
Looping back to this, rather late, we talked about it a bit at office hours. I suspect we’re going to end up adding some configuration surface to it. Ultimately there’s no “right” answer, some sites will want to tune this to aggressive kill osquery, some will want to try to keep state.
Shameless plug, I’ve updated the PR with the separate delay