Juan Alvarez
05/30/2022, 10:23 AMtls
from osquery to FleetDM. In FleetDM, we do some transformations to the message (since we cant get osquery data "as is" in the SIEM side) and forward the data up to our SIEM.
For some reason (Fleet is down, connectivity issues...) osquery cant send data to FleetDM, so it will start buffering the data. After some hours, the FleetDM server is healthy or reachable again, and all the agents starts sending all the data at once.
In our case, a deployment of 1500~ agents with a reasonable amount of activity, will cause the FleetDM server to be overwhelmed (CPU 100%) when every endpoint start to send all the buffered data. This ends up in memory usage and fds piling up and eventually the box becomes unusable.
I wonder how people handle these kind of scenarios and if there is a good way to solve this problem. We can always increase Fleet HW specs or add nodes, but if we do so the CPU usage during normal behavior is too low and it seems like a waste of resources.
I have tried to reduce buffered_log_max
to 1000 to see if a lower amount of buffered events help to solve the issue, but i still find the same behavior.
Any ideas and/or advices are appreciated.
Thanks!Keith Swagler
05/30/2022, 6:55 PMMystery Incorporated
05/31/2022, 8:00 AMJuan Alvarez
05/31/2022, 3:20 PM