Hi everyone Hope anyone can provide some hints on a Fleet si osquery #fleet

Hi, everyone. Hope anyone can provide some hints o...

Francisco Huerta

03/05/2021, 9:17 AM

Hi, everyone. Hope anyone can provide some hints on a Fleet sizing problem we're seeing in our labs: We're running some stress tests with a single Fleet node and at a certain point we start seeing "enrolling too often" errors that lead to Fleet getting unstable. Assuming there has to be a certain break point, are there any techniques to prevent this problem? e.g., enabling multiple network interfaces (currently we only have one) for osquery <> Fleet traffic? Any config parameters to tweak?

👀 2

Francisco Huerta

03/05/2021, 9:18 AM

As said, any guidance, similar experiences, best practices ... would be very useful at this stage. Thanks much!

zwass

03/05/2021, 3:39 PM

Are you running multiple instances of osquery on the same host to do this load testing?

Francisco Huerta

03/05/2021, 5:29 PM

hey, @zwass, our setup is as follows: 10x hosts running approx. 500 dockers each, for a total of 5,000 osquery instances. Those 5,000 endpoints are hitting a single Fleet DM server (eight-core machine)

Francisco Huerta

03/05/2021, 5:30 PM

we got two type of errors: "enrolling too often" and also "TLS handshake error: EOF".

Francisco Huerta

03/05/2021, 5:31 PM

mySQL is running on a separate VM configured with a max of 400 simultaneous connections

zwass

03/05/2021, 6:03 PM

EOF could be due to running out of open sockets/file descriptors and might require adjusting

ulimit

on the server and/or docker hosts.

Francisco Huerta

03/05/2021, 6:16 PM

that's something we suspected too, but we increased the

ulimit

parameter and at peak moments we are not close to that limit

zwass

03/05/2021, 6:39 PM

What are you setting for

--host_identifier

in the Docker hosts?

zwass

03/05/2021, 6:40 PM

Depending on the deployment scenario, that is often a cause of "enrolling too often"

zwass

03/05/2021, 6:40 PM

Setting it to

instance

tends to help

Francisco Huerta

03/05/2021, 6:44 PM

we are not setting it, so I guess it must be getting the default value.

Francisco Huerta

03/05/2021, 6:46 PM

I cannot see the

instance

in the documentation, are you meaning setting it as

--host_identifier =  instance

Francisco Huerta

03/05/2021, 6:46 PM

what would this be helpful for?

Francisco Huerta

03/05/2021, 6:47 PM

(appreciate all prompt replies, by the way, thanks!) 👍

zwass

03/05/2021, 6:48 PM

In case the containers are sharing hardware UUIDs this helps Fleet see each container as a separate instance of osquery.

Francisco Huerta

03/05/2021, 6:51 PM

got you. we will give it a try. thanks so much!

Francisco Huerta

03/05/2021, 7:09 PM

sorry @zwass, do yoy mean

--host_identifier = uuid

, or is it

--host_identifier = instance

? just to confirm I'm doing it right

zwass

03/05/2021, 7:46 PM

Try using instance.

Francisco Huerta

03/05/2021, 8:00 PM

👍

Francisco Huerta

03/09/2021, 2:46 PM

Hey, @zwass. As an update, we've been testing the performance with --host_identifier = instance and we don't see much of difference. After a certain threshold, we see again EOF messages popping up.

Francisco Huerta

03/09/2021, 2:47 PM

When this happens, we see an increase in the number of database connections (from a flat average of 50 when everything works fine to peaks of 400, our limit)

Francisco Huerta

03/09/2021, 2:48 PM

CPU consumption also gets increased to 100

Francisco Huerta

03/09/2021, 2:49 PM

we've tried creating a second network interface to balance incoming connections from the agents to the Fleet manager but we don't see a significant improvement here either

zwass

03/09/2021, 3:57 PM

This is CPU consumption on the Fleet server or the MySQL server?

zwass

03/09/2021, 3:57 PM

FWIW I've load tested Fleet to 150,000+ simulated devices and folks are using Fleet in production on close to 100,000 devices.

zwass

03/09/2021, 3:58 PM

At around 5,000 devices you might want to think about adding a load balancer routing traffic to multiple Fleet servers. But I know I can get more than that running on just my mac laptop.

zwass

03/09/2021, 3:59 PM

The TLS EOF errors are in the osquery logs or the Fleet server logs?

zwass

03/09/2021, 5:59 PM

As for the "enrolling too often" error, I found a bug in osquery that is probably causing this (https://github.com/osquery/osquery/issues/6993). Because of that we are disabling the enrollment cooldown by default in the next release, coming out today. Pulling that down once we release it could help address that issue.

Francisco Huerta

03/09/2021, 6:40 PM

Thanks! trying to answer your questions / comments in order:

Francisco Huerta

03/09/2021, 6:41 PM

CPU consumption refers to the Fleet server

Francisco Huerta

03/09/2021, 6:41 PM

Yes, we've got a load balancer in front of the server

Francisco Huerta

03/09/2021, 6:41 PM

TLS EOF are reported by the Fleet server

Francisco Huerta

03/09/2021, 6:42 PM

thanks for the indication on the bug, we will look into it 👍

Francisco Huerta

03/09/2021, 6:45 PM

An extra insight: we see improvements when setting tls_session_reuse = false

Francisco Huerta

03/09/2021, 6:46 PM

it seems that in our case (and maybe due to the way we create simulated endpoints) a number of connections are kept open over time, causing Fleet eventually to get unstable

15 Views

Open in Slack

Previous Next