Hi guys As we continue with the load tests we re running in osquery #fleet

Hi, guys. As we continue with the load tests we're...

Francisco Huerta

04/15/2021, 2:34 PM

Hi, guys. As we continue with the load tests we're running in our labs, we are still banging our heads against similar problems: in our setup, with a single 2-core, vanilla Fleet 3.10 fronted by a load balancer and with separate SQL and REDIS servers, we are able to reach between 2,000 and 4,000 hosts depending on the config. All hosts are docker-simulated. Apparently, everything runs fine for some time and even the main stats of the Fleet server look great (<5% CPU consumption, etc.), but, at some point in time, the CPU sky-rockets to 100%, the number of database connections increases too and eventually the whole system goes down.

Francisco Huerta

04/15/2021, 2:37 PM

Question / ask is the following: are there any config tips, best practices that allows us to 1) increase the total performance of the Fleet, but 2) even more importantly, allows us to gauge or model the expected behavior for different specs of a FleetDM (e.g., what would be the limits we can expect for an 8-core server)?

Francisco Huerta

04/15/2021, 2:37 PM

Anything you can share would be extremely appreciated and valuable!

Francisco Huerta

04/15/2021, 2:39 PM

The errors we are getting are like

Apr 15 10:16:26 devo-ua-manager fleet[8521]: 2021/04/15 10:15:32 http: TLS handshake error from 10.1.22.17:15104: EOF

zwass

04/15/2021, 3:22 PM

Could you please create a debug archive while the server is in this high-CPU state? https://github.com/fleetdm/fleet/blob/master/docs/1-Using-Fleet/5-Monitoring-Fleet.md#generate-debug-archive-fleet-340

zwass

04/15/2021, 3:22 PM

EOF errors seem to happen sometimes when the ulimit is exceeded. Have you set ulimits appropriately?

zwass

04/15/2021, 3:23 PM

Are the Docker hosts running on the same physical/virtual machine that the Fleet server is running on?

Francisco Huerta

04/15/2021, 3:37 PM

As always, thanks for the quick turnaround, @zwass! we will create a debug archive and share it here. Thanks for that indication

Francisco Huerta

04/15/2021, 3:38 PM

on the other two things: ulimits was something we looked at but was ruled out after we increased the limits.

Francisco Huerta

04/15/2021, 3:38 PM

Dockers run on separate machines than the Fleet

Francisco Huerta

04/15/2021, 3:59 PM

I guess my original question can be stated in a different way: which are the first signs that might suggest a Fleet installation is not properly sized? What we see is that there is virtually no difference between a system that is perfectly behaving and the same collapsing 1h later

Jose Miguel Garrido

04/15/2021, 8:14 PM

hi zwass!! I work with Francisco Huerta

Jose Miguel Garrido

04/15/2021, 8:17 PM

It is difficult to get the debug files, as Francisco said, the fleet CPU use skyrockets to the 100%, so the fleet is totally busy and it doesn't let you to make the login using fleetctl

Jose Miguel Garrido

04/15/2021, 8:18 PM

I was able to get a debug file that was started right before the issue, right before the fleet went 100% CPU

Jose Miguel Garrido

04/15/2021, 8:20 PM

fleet-profiles-archive-2021-04-15T19-13-02Z-NotNormal.tar.gz

Jose Miguel Garrido

04/15/2021, 8:21 PM

I am sending you another file taken 1 minute before the fleet went 100% CPU

Jose Miguel Garrido

04/15/2021, 8:21 PM

fleet-profiles-archive-2021-04-15T19-12-02Z-Normal.tar.gz

zwass

04/15/2021, 9:57 PM

Check out the flame graph from that CPU profile

zwass

04/15/2021, 9:57 PM

It's spending the vast majority of CPU time on negotiating the TLS connection

zwass

04/15/2021, 9:58 PM

Compare with this profile on my local machine with 1000 simulated (via https://github.com/fleetdm/osquery-perf) hosts

zwass

04/15/2021, 9:59 PM

Then I did a little digging and found https://github.com/golang/go/issues/20058

zwass

04/15/2021, 10:00 PM

I wonder if you could generate a new TLS cert like described in https://github.com/traefik/traefik/issues/2673#issuecomment-472348299 and/or terminate TLS with something else to see whether this is connected to that TLS issue.

🙌 1

Jose Miguel Garrido

04/15/2021, 10:12 PM

thanks! We'll try it

Francisco Huerta

04/16/2021, 7:14 AM

thanks @zwass, seems like some way forward. we'll keep everyone posted on the progress

zwass

04/16/2021, 4:43 PM

Yeah, hope it works! This is the first time I've seen an issue like you are describing so it's definitely a learning experience.

Francisco Huerta

04/16/2021, 4:48 PM

So here comes an update: Long story short, you seem to to have put us in the right path! After changing the certs from RSA to ECDSA there is a significant relief in terms of CPU consumption for the TLS handling, and indeed we've been able to duplicate from 2000 endpoints to 4000 using the same baseline configuration.

Francisco Huerta

04/16/2021, 4:49 PM

More testing ahead, but this seems to be something worth it to look at for anyone experiencing similar issues with TLS negotiation

Francisco Huerta

04/16/2021, 4:50 PM

Huge THANK YOU, @zwass! 🙌

Francisco Huerta

04/16/2021, 4:53 PM

Francisco Huerta

04/16/2021, 4:54 PM

2. above shows now the bulk of the CPU consumption, which corresponds to all data processing (JSON marshalling, etc.) and that is the result we would expect in normal conditions

Francisco Huerta

04/16/2021, 4:55 PM

1. on its hand is the portion of CPU related to TLS handling, which is now very reduced compared to our previous starting point

Francisco Huerta

04/16/2021, 4:55 PM

All in all, very very promising

Jose Miguel Garrido

04/16/2021, 5:05 PM

Thanks a lot Zwass for your advice!! As Francisco said, now the system is working much better

zwass

04/16/2021, 5:17 PM

Yes that looks much much better!

zwass

04/16/2021, 5:18 PM

Glad to hear it.

zwass

04/16/2021, 5:18 PM

I'll add something to our documentation about this.

zwass

04/16/2021, 6:42 PM

https://github.com/fleetdm/fleet/pull/656

👍 1

6 Views

Open in Slack

Previous Next