Hi all How do the community handle FleetDM outages Lets thin osquery #fleet

Hi all, How do the community handle FleetDM outag...

Juan Alvarez

05/30/2022, 10:23 AM

Hi all, How do the community handle FleetDM outages? Lets think in the following scenario: A deployment of 1500~2000 endpoints where we are capturing evented tables (i.e: syslog_events, windows_events). Endpoints have a reasonable activity, lets say 20 events every 5 seconds. In our case, we send the data via

tls

from osquery to FleetDM. In FleetDM, we do some transformations to the message (since we cant get osquery data "as is" in the SIEM side) and forward the data up to our SIEM. For some reason (Fleet is down, connectivity issues...) osquery cant send data to FleetDM, so it will start buffering the data. After some hours, the FleetDM server is healthy or reachable again, and all the agents starts sending all the data at once. In our case, a deployment of 1500~ agents with a reasonable amount of activity, will cause the FleetDM server to be overwhelmed (CPU 100%) when every endpoint start to send all the buffered data. This ends up in memory usage and fds piling up and eventually the box becomes unusable. I wonder how people handle these kind of scenarios and if there is a good way to solve this problem. We can always increase Fleet HW specs or add nodes, but if we do so the CPU usage during normal behavior is too low and it seems like a waste of resources. I have tried to reduce

buffered_log_max

to 1000 to see if a lower amount of buffered events help to solve the issue, but i still find the same behavior. Any ideas and/or advices are appreciated. Thanks!

Keith Swagler

05/30/2022, 6:55 PM

I'm curious why you can't send data to your SIEM directly. What most people will likely recommend if you are worried about having idle resources is putting Fleet in Kubernetes so that you can have pods created and destroyed as needed.

Keith Swagler

05/30/2022, 7:00 PM

but honestly at that deployment size I think you probably could just add some more resources

Mystery Incorporated

05/31/2022, 8:00 AM

Some people put a few fleetdm instances behind a load balancer to spread out the load and gain fault tolerance so the situation of a downed fleetdm is super unlikely in the first place. Most cloud providers nowadays have turnkey loadbalance solutions or you could make your own with HAproxy

Juan Alvarez

05/31/2022, 3:20 PM

Thanks for your answer guys.

38 Views

Open in Slack

Previous Next