Hello Fleet. We have recently been experiencing some performance issues with our AWS elasticache redis instance. We are running fleet 3.13.3 with just shy of 40,000 nodes registered. Redis Engine CPU usage has been high for a while but then started to spike to 100% for ~4min, then drop down to 60% for ~3 min, then back up. This would cause the Fleet UI to behave very erratically.
We traced the issue to our distributed interval which was set to 30 sec. Once we increased that to 120, redis engine CPU dropped to basically 0 (just under 1%). it seems clear that the 40,000 nodes checking in every 30 seconds was just too much to for redis to handle, though we scaled this to a very large instance when troubleshooting - cache.r6g.16xlarge.
Couple of questions - it seems clear that the check ins that osquery clients do when pinging fleet to look for distributed queries to run definitely hit redis. What exactly is the check in doing and has this been tuned in more modern versions of fleet?
Anyone out there with sizable osquery / Fleet installations have thoughts on what's a 'best practice' for the distributed interval setting? Do enterprise Fleet users generally never run distributed queries and instead take the time to always build packs, or are people just setting the interval in the couple min range and dealing with the delay to get results from ad-hoc UI queries?
04/12/2022, 12:11 AM
Hi Dan, quite a few performance improvements were made since 3.13.0 (btw, I can't find evidence that we published a 3.13.3, are you sure that's the version you are on?)
Clients hit Redis to determine what, if any, live queries they should run.
We've run quite a few more than 40,000 clients on a fairly moderate Redis instance in our recent load tests (https://github.com/fleetdm/fleet/blob/main/CHANGELOG.md#load-test-infrastructure), with the hosts configured on a 10s
Overall, I'd strongly recommend upgrading as a lot has changed in a year!
04/12/2022, 4:02 PM
also, sorry - 3.13.0
@zwass - would creating labels impact redis or is that all just at the DB. We do have a lot of labels and I think the way we create / recreate them is not super efficient currently, so we are thinking about how to do that better
04/12/2022, 4:29 PM
Labels should be DB only.
04/12/2022, 4:53 PM
one last question zach - when we upgrade to 4.something, which we plan to, we will go to a clustered redis. any rule of thumb on how many primaries to use per X # of machines? Something like - 1 primary redis node per 100,000 machines (for example).