Title
#fleet
j

Juan Alvarez

09/21/2021, 11:23 AM
Hi all, i have noticed that CPU consumption in FleetDM side is way lower when i put a LB (AWS ALB in this case) between FleetDM and osquery. Have anybody noticed the same behavior? The LB does not terminate TLS so i am trying to understand why CPU consumption differs so much
Tomas Touceda

Tomas Touceda

09/21/2021, 1:39 PM
hi Juan, that's very odd. Does the LB add some delay potentially to the requests making it less intense on the fleet side? Could you share your fleet configs so that we can see if there's anything that could cause high CPU usage?
j

Juan Alvarez

09/21/2021, 2:45 PM
I think all the configs are pretty normal, i have some custom configs that i have removed mostly related to a logger plugin that we have developed (i removed those for clarity). I am using FleetDM 3.13.
FLEET_AUTH_JWT_KEY='******************************'
FLEET_REDIS_ADDRESS='localhost:6379'
FLEET_REDIS_PASSWORD=''
FLEET_REDIS_DATABASE='0'
FLEET_MYSQL_ADDRESS='localhost:3306'
FLEET_MYSQL_DATABASE='*****'
FLEET_MYSQL_USERNAME='*****'
FLEET_MYSQL_PASSWORD='*****'
FLEET_MYSQL_MAX_IDLE_CONNS='50'
FLEET_MYSQL_MAX_OPEN_CONNS='100'
FLEET_SERVER_CERT=/etc/devo-ea-manager/certs/devo-ea-manager.crt
FLEET_SERVER_KEY=/etc/devo-ea-manager/certs/devo-ea-manager.key
FLEET_LOGGING_JSON=true
FLEET_OSQUERY_RESULT_LOG_PLUGIN=devo
FLEET_OSQUERY_STATUS_LOG_PLUGIN=devo
FLEET_OSQUERY_LABEL_UPDATE_INTERVAL='1h'
Tomas Touceda

Tomas Touceda

09/21/2021, 2:55 PM
gotcha, not sure with that version, there have been a lot of changes since then
j

Juan Alvarez

09/21/2021, 3:04 PM
I am running eveything in the same box (in AWS in a c5.large), with redis and sql in local in a docker. Without a LB i get 50% CPU average with 1.4k endpoints, but with the LB CPU averages under 20%
Tomas Touceda

Tomas Touceda

09/21/2021, 3:09 PM
I'll ask around on my end to see if any ideas come up. Could you share logs for fleet without LB and with LB?
j

Juan Alvarez

09/21/2021, 3:13 PM
There are not really logs to share, since fleet is not throwing any errors
3:15 PM
Another piece of info, is that we use the fleet server to send data to our SIEM via the custom logger we made. So, we use fleet for config, disributed and submitlogs as well
3:15 PM
Most of the traffic is about this last thing. We only read configs every 900 seconds and distributed queries every minute
Tomas Touceda

Tomas Touceda

09/21/2021, 3:50 PM
There are not really logs to share, since fleet is not throwing any errors
well, my goal was to compare timestamps on some endpoints to see if there could be a slowdown happening
3:51 PM
roughly every minute might be too often, have you consider getting upping those and seeing how CPU looks?
zwass

zwass

09/21/2021, 4:09 PM
IIRC there was something with the Go stdlib TLS that caused high CPU consumption in some older versions of Fleet when terminating TLS. Let me see if I can dig that up.
4:24 PM
Anyone experiencing similar issues should generate an ECDSA certificate to alleviate performance problems with Go's TLS termination.
j

Juan Alvarez

09/22/2021, 7:21 AM
We do already use ECDSA so i dont think this is related. What i am trying to understand is why they CPU consumption would lower when adding a LB to the same scenario.
mikermcneil

mikermcneil

09/23/2021, 3:48 AM
Random thought: Could osquery be leaving TCP connections hanging in such a way that resources get eaten up on the Fleet server for a while even after the HTTP response is sent? If so, then maybe the LB is doing some TCP cleanup (or maybe timing out long HTTP requests more quickly) that makes the problem go away?
zwass

zwass

09/23/2021, 3:39 PM
@Juan Alvarez we can dig into this further if you provide us some debug archives. Please see https://fleetdm.com/docs/using-fleet/monitoring-fleet#generate-debug-archive-fleet-3-4-0. Can you make one archive from the situation with the LB and one with Fleet terminating TLS please?
j

Juan Alvarez

09/23/2021, 3:42 PM
@mikermcneil that was my thought, we have seen many connection leaving in time_wait status without reuse and there was some improvement when setting --tls_sessions_reuse to false. @zwass yes, i will do so, but i need some time to rebuild environments during my testing. Will come back to this thread with the info.
zwass

zwass

09/23/2021, 3:52 PM
Thank you! We can also get some sense of how much memory is allocated to connections from that debug archive.
j

Juan Alvarez

09/24/2021, 3:12 PM
I have captured 3 snapshots of each case, and prefixed them as USINGLB and NOLB. This is a AWS c5.large instance with 1.4k endpoints running in both cases.
3:13 PM
Also i checked same behavior, CPU is around 20% with the LB, around 50% without it.
Tomas Touceda

Tomas Touceda

09/24/2021, 3:19 PM
oh, the LB is terminating TLS instead of fleet
j

Juan Alvarez

09/24/2021, 3:26 PM
it should not, is it?
3:27 PM
i thought Fleet would only take TLS traffic?
Tomas Touceda

Tomas Touceda

09/24/2021, 3:28 PM
not necessarily, osquery is the one that is required. If you have a LB, the LB terminates TLS, and then distributes in the different fleet instances
j

Juan Alvarez

09/24/2021, 3:34 PM
oh, then i misunderstood completely. I was under the impression that traffic to fleet was always TLS. I just created an AWS HTTPS LB and set the certificate there, but i did not think that the traffic would be clear from there, and that Fleet would accept it 😒
Tomas Touceda

Tomas Touceda

09/24/2021, 3:54 PM
gotcha, yeah, it's a pretty standard practice to terminate TLS at the LB, and then not worry about encryption from that point on, as it's already within your network
zwass

zwass

09/24/2021, 4:38 PM
Since the Fleet server can terminate TLS you have 3 options:1) Terminate at the LB and HTTP LB -> Fleet 2) Terminate at LB and reencrypt HTTPS LB -> Fleet 3) Pass through the LB and terminate at Fleet 1 and 2 are probably the most common
j

Juan Alvarez

09/24/2021, 4:52 PM
Thanks guys, i am trying to open the results myself, but for some reason i cant get pprof to work... So you say, that the actual situation that you are seeing in the capture is the situation 1, and right now the traffic from the LB-->Fleet is clear, and hence CPU consumption is lower. isnt it?
zwass

zwass

09/24/2021, 5:08 PM
go tool pprof -http localhost:8081 profile
is what I use
5:08 PM
(go version: 1.17)
j

Juan Alvarez

09/24/2021, 5:10 PM
oh yes, that worked 😄
5:12 PM
ok, i can see that there is no TLS
5:13 PM
well, thank you all guys for the help, i had a big misunderstanding and did not realize that the LB was actually terminating the TLS