Hi, guys. As we continue with the load tests we're...
# fleet
f
Hi, guys. As we continue with the load tests we're running in our labs, we are still banging our heads against similar problems: in our setup, with a single 2-core, vanilla Fleet 3.10 fronted by a load balancer and with separate SQL and REDIS servers, we are able to reach between 2,000 and 4,000 hosts depending on the config. All hosts are docker-simulated. Apparently, everything runs fine for some time and even the main stats of the Fleet server look great (<5% CPU consumption, etc.), but, at some point in time, the CPU sky-rockets to 100%, the number of database connections increases too and eventually the whole system goes down.
Question / ask is the following: are there any config tips, best practices that allows us to 1) increase the total performance of the Fleet, but 2) even more importantly, allows us to gauge or model the expected behavior for different specs of a FleetDM (e.g., what would be the limits we can expect for an 8-core server)?
Anything you can share would be extremely appreciated and valuable!
The errors we are getting are like
Apr 15 10:16:26 devo-ua-manager fleet[8521]: 2021/04/15 10:15:32 http: TLS handshake error from 10.1.22.17:15104: EOF
z
Could you please create a debug archive while the server is in this high-CPU state? https://github.com/fleetdm/fleet/blob/master/docs/1-Using-Fleet/5-Monitoring-Fleet.md#generate-debug-archive-fleet-340
EOF errors seem to happen sometimes when the ulimit is exceeded. Have you set ulimits appropriately?
Are the Docker hosts running on the same physical/virtual machine that the Fleet server is running on?
f
As always, thanks for the quick turnaround, @zwass! we will create a debug archive and share it here. Thanks for that indication
on the other two things: ulimits was something we looked at but was ruled out after we increased the limits.
Dockers run on separate machines than the Fleet
I guess my original question can be stated in a different way: which are the first signs that might suggest a Fleet installation is not properly sized? What we see is that there is virtually no difference between a system that is perfectly behaving and the same collapsing 1h later
j
hi zwass!! I work with Francisco Huerta
It is difficult to get the debug files, as Francisco said, the fleet CPU use skyrockets to the 100%, so the fleet is totally busy and it doesn't let you to make the login using fleetctl
I was able to get a debug file that was started right before the issue, right before the fleet went 100% CPU
I am sending you another file taken 1 minute before the fleet went 100% CPU
z
Check out the flame graph from that CPU profile
It's spending the vast majority of CPU time on negotiating the TLS connection
Compare with this profile on my local machine with 1000 simulated (via https://github.com/fleetdm/osquery-perf) hosts
Then I did a little digging and found https://github.com/golang/go/issues/20058
I wonder if you could generate a new TLS cert like described in https://github.com/traefik/traefik/issues/2673#issuecomment-472348299 and/or terminate TLS with something else to see whether this is connected to that TLS issue.
🙌 1
j
thanks! We'll try it
f
thanks @zwass, seems like some way forward. we'll keep everyone posted on the progress
z
Yeah, hope it works! This is the first time I've seen an issue like you are describing so it's definitely a learning experience.
f
So here comes an update: Long story short, you seem to to have put us in the right path! After changing the certs from RSA to ECDSA there is a significant relief in terms of CPU consumption for the TLS handling, and indeed we've been able to duplicate from 2000 endpoints to 4000 using the same baseline configuration.
More testing ahead, but this seems to be something worth it to look at for anyone experiencing similar issues with TLS negotiation
Huge THANK YOU, @zwass! 🙌
2. above shows now the bulk of the CPU consumption, which corresponds to all data processing (JSON marshalling, etc.) and that is the result we would expect in normal conditions
1. on its hand is the portion of CPU related to TLS handling, which is now very reduced compared to our previous starting point
All in all, very very promising
j
Thanks a lot Zwass for your advice!! As Francisco said, now the system is working much better
z
Yes that looks much much better!
Glad to hear it.
I'll add something to our documentation about this.