Hello everyone. Has anyone tried to stress test F...
# kolide
d
Hello everyone. Has anyone tried to stress test Fleet? If so, where did you find that it broke down? When we were discussing this with Kolide when considering the commercial offering we were told that it starts to have problems when it gets in the 5 digits (aka 10,000 +) clients. If that's the case, I'm wondering what breaks - is that a single server that could just be scaled out, or is the limitation somewhere in the Redis or DB stack? We've been assuming that we'll need to shard our fleet environments when we get around 10k clients, but if we could just scale the UI horizontally to limit that, we'd of course rather do that. Eventually we'll get to 60,000+ clients and are planning to start some stress testing, but if others have good experience here, I'd love to learn from it.
z
That was the case up til a few months ago. It's now been stress tested up to 150k.
d
oh, that's awesome @zwass. Is that with a single UI server?
also, is that the version that just wen GA - 3.2.0 (I think)?
z
3.0.0+
Let me take a look at my notes about the server setup. But it's MySQL where the scaling challenge is.
d
awesome, thanks
z
Fleet servers easily scale horizontally (and are pretty low resource anyway). MySQL is harder to scale, but I was using a "Large" AWS MySQL server.
MySQL: db.m4.4xlarge Core count - 8 vCPU - 16 Memory - 64GB Redis: cache.t2.micro vCPU - 1 Memory - 0.555GB Fleet servers: 6 server instances running in containers on AWS Fargate
This scenario had 160,000 "online hosts" and 700,000 "enrolled hosts" (offline + online). Fleet servers and MySQL were sitting at ~30-50% CPU. Redis went up to about 25% CPU.
d
nice
z
Keep in mind this is all simulated hosts via github.com/dactivllc/osquery-perf.
d
was the CPU usage on myql mostly writes?
z
I don't have notes on that, but I would think it is substantially writes.
d
ya, that's what I assumed
this is great, thanks so much for the info
z
Please take some notes if you make a large deploy. I'd like to compile this information to document it better for folks.
d
will do. 🙂
🍻 1
how was RAM usage during all of that?
z
I don't have any notes on that. RAM has typically not been an issue for Fleet scaling.
d
kk, thanks
can i ask one more item @zwass, what data is fleet writing to mysql that would cause significant CPU usage? I assume the query results just get written to logs so the DB side would just be things like clients and metadata about the clients
z
Every request from an osquery client generates at least one db read/write (authentication and marking online status). Then depending on the request the label membership may need to be updated, the host details updated, configs (packs/queries) generated, etc.
d
ok, thaniks
z
There's certainly more caching and optimization that can be done, but this does the job for now.
d
for sure.
when you did your testing did you have packs created and assigned to these hosts, or were you really just testing the initial registration and the authentication and status updates for periodic check-ins?
hey @zwass, sorry to bug you on this, but just wanted to check on that last question ^. We are very likely going to leverage your load-testing tool just to baseline how our specific environment (instance sizes, etc) perform, but I wanted to verify if your testing included anything with packs / queries or if it was all just about getting massive amounts of nodes connected to Fleet. Thanks!
z
In my test the hosts do respond with some fake data to the detail queries, as well as a standard response to a live query. It did not actually test logging scheduled query results, though this is a pretty lightweight operation as the only database call is to authenticate the host.
d
ok, awesome. Thanks. We'll likely run some packs and maybe some ad-hoc searches as well
thanks again