Hello Fleet. We have recently been experiencing some performance issues with our AWS elasticache redis instance. We are running fleet 3.13.3 with just shy of 40,000 nodes registered. Redis Engine CPU usage has been high for a while but then started to spike to 100% for ~4min, then drop down to 60% for ~3 min, then back up. This would cause the Fleet UI to behave very erratically.
We traced the issue to our distributed interval which was set to 30 sec. Once we increased that to 120, redis engine CPU dropped to basically 0 (just under 1%). it seems clear that the 40,000 nodes checking in every 30 seconds was just too much to for redis to handle, though we scaled this to a very large instance when troubleshooting - cache.r6g.16xlarge.
Couple of questions - it seems clear that the check ins that osquery clients do when pinging fleet to look for distributed queries to run definitely hit redis. What exactly is the check in doing and has this been tuned in more modern versions of fleet?
Anyone out there with sizable osquery / Fleet installations have thoughts on what's a 'best practice' for the distributed interval setting? Do enterprise Fleet users generally never run distributed queries and instead take the time to always build packs, or are people just setting the interval in the couple min range and dealing with the delay to get results from ad-hoc UI queries?
Thanks!