Hi all, How do the community handle FleetDM outag...
# fleet
j
Hi all, How do the community handle FleetDM outages? Lets think in the following scenario: A deployment of 1500~2000 endpoints where we are capturing evented tables (i.e: syslog_events, windows_events). Endpoints have a reasonable activity, lets say 20 events every 5 seconds. In our case, we send the data via
tls
from osquery to FleetDM. In FleetDM, we do some transformations to the message (since we cant get osquery data "as is" in the SIEM side) and forward the data up to our SIEM. For some reason (Fleet is down, connectivity issues...) osquery cant send data to FleetDM, so it will start buffering the data. After some hours, the FleetDM server is healthy or reachable again, and all the agents starts sending all the data at once. In our case, a deployment of 1500~ agents with a reasonable amount of activity, will cause the FleetDM server to be overwhelmed (CPU 100%) when every endpoint start to send all the buffered data. This ends up in memory usage and fds piling up and eventually the box becomes unusable. I wonder how people handle these kind of scenarios and if there is a good way to solve this problem. We can always increase Fleet HW specs or add nodes, but if we do so the CPU usage during normal behavior is too low and it seems like a waste of resources. I have tried to reduce
buffered_log_max
to 1000 to see if a lower amount of buffered events help to solve the issue, but i still find the same behavior. Any ideas and/or advices are appreciated. Thanks!
k
I'm curious why you can't send data to your SIEM directly. What most people will likely recommend if you are worried about having idle resources is putting Fleet in Kubernetes so that you can have pods created and destroyed as needed.
but honestly at that deployment size I think you probably could just add some more resources
m
Some people put a few fleetdm instances behind a load balancer to spread out the load and gain fault tolerance so the situation of a downed fleetdm is super unlikely in the first place. Most cloud providers nowadays have turnkey loadbalance solutions or you could make your own with HAproxy
j
Thanks for your answer guys.