Title
#kolide
Dan Achin

Dan Achin

10/27/2020, 7:00 PM
Hello everyone. Has anyone tried to stress test Fleet? If so, where did you find that it broke down? When we were discussing this with Kolide when considering the commercial offering we were told that it starts to have problems when it gets in the 5 digits (aka 10,000 +) clients. If that's the case, I'm wondering what breaks - is that a single server that could just be scaled out, or is the limitation somewhere in the Redis or DB stack? We've been assuming that we'll need to shard our fleet environments when we get around 10k clients, but if we could just scale the UI horizontally to limit that, we'd of course rather do that. Eventually we'll get to 60,000+ clients and are planning to start some stress testing, but if others have good experience here, I'd love to learn from it.
zwass

zwass

10/27/2020, 7:13 PM
That was the case up til a few months ago. It's now been stress tested up to 150k.
Dan Achin

Dan Achin

10/27/2020, 8:00 PM
oh, that's awesome @zwass. Is that with a single UI server?
8:39 PM
also, is that the version that just wen GA - 3.2.0 (I think)?
zwass

zwass

10/27/2020, 8:39 PM
3.0.0+
8:40 PM
Let me take a look at my notes about the server setup. But it's MySQL where the scaling challenge is.
Dan Achin

Dan Achin

10/27/2020, 8:40 PM
awesome, thanks
zwass

zwass

10/27/2020, 8:41 PM
Fleet servers easily scale horizontally (and are pretty low resource anyway). MySQL is harder to scale, but I was using a "Large" AWS MySQL server.
8:46 PM
MySQL: db.m4.4xlarge Core count - 8 vCPU - 16 Memory - 64GB Redis: cache.t2.micro vCPU - 1 Memory - 0.555GB Fleet servers: 6 server instances running in containers on AWS Fargate
8:47 PM
This scenario had 160,000 "online hosts" and 700,000 "enrolled hosts" (offline + online). Fleet servers and MySQL were sitting at ~30-50% CPU. Redis went up to about 25% CPU.
Dan Achin

Dan Achin

10/27/2020, 8:48 PM
nice
zwass

zwass

10/27/2020, 8:48 PM
Keep in mind this is all simulated hosts via github.com/dactivllc/osquery-perf.
Dan Achin

Dan Achin

10/27/2020, 8:48 PM
was the CPU usage on myql mostly writes?
zwass

zwass

10/27/2020, 8:49 PM
I don't have notes on that, but I would think it is substantially writes.
Dan Achin

Dan Achin

10/27/2020, 8:49 PM
ya, that's what I assumed
8:49 PM
this is great, thanks so much for the info
zwass

zwass

10/27/2020, 8:50 PM
Please take some notes if you make a large deploy. I'd like to compile this information to document it better for folks.
Dan Achin

Dan Achin

10/27/2020, 8:51 PM
will do. 🙂
8:56 PM
how was RAM usage during all of that?
zwass

zwass

10/27/2020, 8:58 PM
I don't have any notes on that. RAM has typically not been an issue for Fleet scaling.
Dan Achin

Dan Achin

10/27/2020, 8:59 PM
kk, thanks
9:12 PM
can i ask one more item @zwass, what data is fleet writing to mysql that would cause significant CPU usage? I assume the query results just get written to logs so the DB side would just be things like clients and metadata about the clients
zwass

zwass

10/27/2020, 9:14 PM
Every request from an osquery client generates at least one db read/write (authentication and marking online status). Then depending on the request the label membership may need to be updated, the host details updated, configs (packs/queries) generated, etc.
Dan Achin

Dan Achin

10/27/2020, 9:15 PM
ok, thaniks
zwass

zwass

10/27/2020, 9:15 PM
There's certainly more caching and optimization that can be done, but this does the job for now.
Dan Achin

Dan Achin

10/27/2020, 10:59 PM
for sure.
11:00 PM
when you did your testing did you have packs created and assigned to these hosts, or were you really just testing the initial registration and the authentication and status updates for periodic check-ins?
5:39 PM
hey @zwass, sorry to bug you on this, but just wanted to check on that last question ^. We are very likely going to leverage your load-testing tool just to baseline how our specific environment (instance sizes, etc) perform, but I wanted to verify if your testing included anything with packs / queries or if it was all just about getting massive amounts of nodes connected to Fleet. Thanks!
zwass

zwass

10/28/2020, 5:49 PM
In my test the hosts do respond with some fake data to the detail queries, as well as a standard response to a live query. It did not actually test logging scheduled query results, though this is a pretty lightweight operation as the only database call is to authenticate the host.
Dan Achin

Dan Achin

10/28/2020, 6:26 PM
ok, awesome. Thanks. We'll likely run some packs and maybe some ad-hoc searches as well
6:26 PM
thanks again