Hello Fleet We have recently been experiencing some performa osquery #fleet

Hello Fleet. We have recently been experiencing s...

Dan Achin

04/11/2022, 11:26 PM

Hello Fleet. We have recently been experiencing some performance issues with our AWS elasticache redis instance. We are running fleet 3.13.3 with just shy of 40,000 nodes registered. Redis Engine CPU usage has been high for a while but then started to spike to 100% for ~4min, then drop down to 60% for ~3 min, then back up. This would cause the Fleet UI to behave very erratically. We traced the issue to our distributed interval which was set to 30 sec. Once we increased that to 120, redis engine CPU dropped to basically 0 (just under 1%). it seems clear that the 40,000 nodes checking in every 30 seconds was just too much to for redis to handle, though we scaled this to a very large instance when troubleshooting - cache.r6g.16xlarge. Couple of questions - it seems clear that the check ins that osquery clients do when pinging fleet to look for distributed queries to run definitely hit redis. What exactly is the check in doing and has this been tuned in more modern versions of fleet? Anyone out there with sizable osquery / Fleet installations have thoughts on what's a 'best practice' for the distributed interval setting? Do enterprise Fleet users generally never run distributed queries and instead take the time to always build packs, or are people just setting the interval in the couple min range and dealing with the delay to get results from ad-hoc UI queries? Thanks!

zwass

04/12/2022, 12:11 AM

Hi Dan, quite a few performance improvements were made since 3.13.0 (btw, I can't find evidence that we published a 3.13.3, are you sure that's the version you are on?) Clients hit Redis to determine what, if any, live queries they should run. We've run quite a few more than 40,000 clients on a fairly moderate Redis instance in our recent load tests (https://github.com/fleetdm/fleet/blob/main/CHANGELOG.md#load-test-infrastructure), with the hosts configured on a 10s

distributed_interval

. Overall, I'd strongly recommend upgrading as a lot has changed in a year!

Dan Achin

04/12/2022, 4:02 PM

Thanks @zwass

Dan Achin

04/12/2022, 4:03 PM

also, sorry - 3.13.0

Dan Achin

04/12/2022, 4:25 PM

@zwass - would creating labels impact redis or is that all just at the DB. We do have a lot of labels and I think the way we create / recreate them is not super efficient currently, so we are thinking about how to do that better

zwass

04/12/2022, 4:29 PM

Labels should be DB only.

Dan Achin

04/12/2022, 4:53 PM

excellent, thanks

Dan Achin

04/12/2022, 4:56 PM

one last question zach - when we upgrade to 4.something, which we plan to, we will go to a clustered redis. any rule of thumb on how many primaries to use per X # of machines? Something like - 1 primary redis node per 100,000 machines (for example).

zwass

04/12/2022, 5:01 PM

Check out https://fleetdm.com/docs/deploying/reference-architectures

Dan Achin

04/12/2022, 5:23 PM

Great, thanks

20 Views

Open in Slack

Previous Next