Title
#fleet
f

Francisco Huerta

03/05/2021, 9:17 AM
Hi, everyone. Hope anyone can provide some hints on a Fleet sizing problem we're seeing in our labs: We're running some stress tests with a single Fleet node and at a certain point we start seeing "enrolling too often" errors that lead to Fleet getting unstable. Assuming there has to be a certain break point, are there any techniques to prevent this problem? e.g., enabling multiple network interfaces (currently we only have one) for osquery <> Fleet traffic? Any config parameters to tweak?
9:18 AM
As said, any guidance, similar experiences, best practices ... would be very useful at this stage. Thanks much!
zwass

zwass

03/05/2021, 3:39 PM
Are you running multiple instances of osquery on the same host to do this load testing?
f

Francisco Huerta

03/05/2021, 5:29 PM
hey, @zwass, our setup is as follows: 10x hosts running approx. 500 dockers each, for a total of 5,000 osquery instances. Those 5,000 endpoints are hitting a single Fleet DM server (eight-core machine)
5:30 PM
we got two type of errors: "enrolling too often" and also "TLS handshake error: EOF".
5:31 PM
mySQL is running on a separate VM configured with a max of 400 simultaneous connections
zwass

zwass

03/05/2021, 6:03 PM
EOF could be due to running out of open sockets/file descriptors and might require adjusting
ulimit
on the server and/or docker hosts.
f

Francisco Huerta

03/05/2021, 6:16 PM
that's something we suspected too, but we increased the
ulimit
parameter and at peak moments we are not close to that limit
zwass

zwass

03/05/2021, 6:39 PM
What are you setting for
--host_identifier
in the Docker hosts?
6:40 PM
Depending on the deployment scenario, that is often a cause of "enrolling too often"
6:40 PM
Setting it to
instance
tends to help
f

Francisco Huerta

03/05/2021, 6:44 PM
we are not setting it, so I guess it must be getting the default value.
6:46 PM
I cannot see the
instance
in the documentation, are you meaning setting it as
--host_identifier =  instance
?
6:46 PM
what would this be helpful for?
6:47 PM
(appreciate all prompt replies, by the way, thanks!) 👍
zwass

zwass

03/05/2021, 6:48 PM
In case the containers are sharing hardware UUIDs this helps Fleet see each container as a separate instance of osquery.
f

Francisco Huerta

03/05/2021, 6:51 PM
got you. we will give it a try. thanks so much!
7:09 PM
sorry @zwass, do yoy mean
--host_identifier = uuid
, or is it
--host_identifier = instance
? just to confirm I'm doing it right
zwass

zwass

03/05/2021, 7:46 PM
Try using instance.
f

Francisco Huerta

03/05/2021, 8:00 PM
👍
2:46 PM
Hey, @zwass. As an update, we've been testing the performance with --host_identifier = instance and we don't see much of difference. After a certain threshold, we see again EOF messages popping up.
2:47 PM
When this happens, we see an increase in the number of database connections (from a flat average of 50 when everything works fine to peaks of 400, our limit)
2:48 PM
CPU consumption also gets increased to 100
2:49 PM
we've tried creating a second network interface to balance incoming connections from the agents to the Fleet manager but we don't see a significant improvement here either
zwass

zwass

03/09/2021, 3:57 PM
This is CPU consumption on the Fleet server or the MySQL server?
3:57 PM
FWIW I've load tested Fleet to 150,000+ simulated devices and folks are using Fleet in production on close to 100,000 devices.
3:58 PM
At around 5,000 devices you might want to think about adding a load balancer routing traffic to multiple Fleet servers. But I know I can get more than that running on just my mac laptop.
3:59 PM
The TLS EOF errors are in the osquery logs or the Fleet server logs?
5:59 PM
As for the "enrolling too often" error, I found a bug in osquery that is probably causing this (https://github.com/osquery/osquery/issues/6993). Because of that we are disabling the enrollment cooldown by default in the next release, coming out today. Pulling that down once we release it could help address that issue.
f

Francisco Huerta

03/09/2021, 6:40 PM
Thanks! trying to answer your questions / comments in order:
6:41 PM
CPU consumption refers to the Fleet server
6:41 PM
Yes, we've got a load balancer in front of the server
6:41 PM
TLS EOF are reported by the Fleet server
6:42 PM
thanks for the indication on the bug, we will look into it 👍
6:45 PM
An extra insight: we see improvements when setting tls_session_reuse = false
6:46 PM
it seems that in our case (and maybe due to the way we create simulated endpoints) a number of connections are kept open over time, causing Fleet eventually to get unstable