Hi, everyone. Hope anyone can provide some hints o...
# fleet
f
Hi, everyone. Hope anyone can provide some hints on a Fleet sizing problem we're seeing in our labs: We're running some stress tests with a single Fleet node and at a certain point we start seeing "enrolling too often" errors that lead to Fleet getting unstable. Assuming there has to be a certain break point, are there any techniques to prevent this problem? e.g., enabling multiple network interfaces (currently we only have one) for osquery <> Fleet traffic? Any config parameters to tweak?
👀 2
As said, any guidance, similar experiences, best practices ... would be very useful at this stage. Thanks much!
z
Are you running multiple instances of osquery on the same host to do this load testing?
f
hey, @zwass, our setup is as follows: 10x hosts running approx. 500 dockers each, for a total of 5,000 osquery instances. Those 5,000 endpoints are hitting a single Fleet DM server (eight-core machine)
we got two type of errors: "enrolling too often" and also "TLS handshake error: EOF".
mySQL is running on a separate VM configured with a max of 400 simultaneous connections
z
EOF could be due to running out of open sockets/file descriptors and might require adjusting
ulimit
on the server and/or docker hosts.
f
that's something we suspected too, but we increased the
ulimit
parameter and at peak moments we are not close to that limit
z
What are you setting for
--host_identifier
in the Docker hosts?
Depending on the deployment scenario, that is often a cause of "enrolling too often"
Setting it to
instance
tends to help
f
we are not setting it, so I guess it must be getting the default value.
I cannot see the
instance
in the documentation, are you meaning setting it as
--host_identifier =  instance
?
what would this be helpful for?
(appreciate all prompt replies, by the way, thanks!) 👍
z
In case the containers are sharing hardware UUIDs this helps Fleet see each container as a separate instance of osquery.
f
got you. we will give it a try. thanks so much!
sorry @zwass, do yoy mean
--host_identifier = uuid
, or is it
--host_identifier = instance
? just to confirm I'm doing it right
z
Try using instance.
f
👍
Hey, @zwass. As an update, we've been testing the performance with --host_identifier = instance and we don't see much of difference. After a certain threshold, we see again EOF messages popping up.
When this happens, we see an increase in the number of database connections (from a flat average of 50 when everything works fine to peaks of 400, our limit)
CPU consumption also gets increased to 100
we've tried creating a second network interface to balance incoming connections from the agents to the Fleet manager but we don't see a significant improvement here either
z
This is CPU consumption on the Fleet server or the MySQL server?
FWIW I've load tested Fleet to 150,000+ simulated devices and folks are using Fleet in production on close to 100,000 devices.
At around 5,000 devices you might want to think about adding a load balancer routing traffic to multiple Fleet servers. But I know I can get more than that running on just my mac laptop.
The TLS EOF errors are in the osquery logs or the Fleet server logs?
As for the "enrolling too often" error, I found a bug in osquery that is probably causing this (https://github.com/osquery/osquery/issues/6993). Because of that we are disabling the enrollment cooldown by default in the next release, coming out today. Pulling that down once we release it could help address that issue.
f
Thanks! trying to answer your questions / comments in order:
CPU consumption refers to the Fleet server
Yes, we've got a load balancer in front of the server
TLS EOF are reported by the Fleet server
thanks for the indication on the bug, we will look into it 👍
An extra insight: we see improvements when setting tls_session_reuse = false
it seems that in our case (and maybe due to the way we create simulated endpoints) a number of connections are kept open over time, causing Fleet eventually to get unstable