Hi everyone. Hope you guys can give a hand here because I'm running out of ideas: in a production environment (Windows hosts) we are experiencing a severe loss of Windows Events in a number of servers exposed to a heavy load (approx. 100 new events/second).
No matter all different tweaks we've tried to apply (e.g., reducing intervals, etc.) we end up in the same situation in which after some time, the agents starts introducing delays between the time the event is recorded in Windows and the TLS logging time.
Best outcomes we've gotten are a fairly stabilized events collection during a couple of hours, but then suddenly we see the first delays / losses ocurring.
Any guidance on how to configure osquery prevent this would be highly appreciated. Thanks much in advance.
Thanks, @theopolis. Although the initial conditions are similar to the one described in there, the problem we are experiencing now is somehow different and apparently unrelated: we see that there is a number of Windows Events that never make it to the central collection server and we don't know if this is a TLS related problem or that osquery is not able to cope with certain workloads.
A question would be if there's a way we can understand the osquery agent is dropping query results? Our understanding is that if a query is not killed by the watchdog, then its results should be processed entirely and eventually sent to the external logging system, but that is not happening.