Hi all, i have noticed that CPU consumption in Fle...
# fleet
j
Hi all, i have noticed that CPU consumption in FleetDM side is way lower when i put a LB (AWS ALB in this case) between FleetDM and osquery. Have anybody noticed the same behavior? The LB does not terminate TLS so i am trying to understand why CPU consumption differs so much
t
hi Juan, that's very odd. Does the LB add some delay potentially to the requests making it less intense on the fleet side? Could you share your fleet configs so that we can see if there's anything that could cause high CPU usage?
j
I think all the configs are pretty normal, i have some custom configs that i have removed mostly related to a logger plugin that we have developed (i removed those for clarity). I am using FleetDM 3.13.
Copy code
FLEET_AUTH_JWT_KEY='******************************'
FLEET_REDIS_ADDRESS='localhost:6379'
FLEET_REDIS_PASSWORD=''
FLEET_REDIS_DATABASE='0'
FLEET_MYSQL_ADDRESS='localhost:3306'
FLEET_MYSQL_DATABASE='*****'
FLEET_MYSQL_USERNAME='*****'
FLEET_MYSQL_PASSWORD='*****'
FLEET_MYSQL_MAX_IDLE_CONNS='50'
FLEET_MYSQL_MAX_OPEN_CONNS='100'
FLEET_SERVER_CERT=/etc/devo-ea-manager/certs/devo-ea-manager.crt
FLEET_SERVER_KEY=/etc/devo-ea-manager/certs/devo-ea-manager.key
FLEET_LOGGING_JSON=true
FLEET_OSQUERY_RESULT_LOG_PLUGIN=devo
FLEET_OSQUERY_STATUS_LOG_PLUGIN=devo
FLEET_OSQUERY_LABEL_UPDATE_INTERVAL='1h'
t
gotcha, not sure with that version, there have been a lot of changes since then
j
I am running eveything in the same box (in AWS in a c5.large), with redis and sql in local in a docker. Without a LB i get 50% CPU average with 1.4k endpoints, but with the LB CPU averages under 20%
t
I'll ask around on my end to see if any ideas come up. Could you share logs for fleet without LB and with LB?
j
There are not really logs to share, since fleet is not throwing any errors
Another piece of info, is that we use the fleet server to send data to our SIEM via the custom logger we made. So, we use fleet for config, disributed and submitlogs as well
Most of the traffic is about this last thing. We only read configs every 900 seconds and distributed queries every minute
t
There are not really logs to share, since fleet is not throwing any errors
well, my goal was to compare timestamps on some endpoints to see if there could be a slowdown happening
roughly every minute might be too often, have you consider getting upping those and seeing how CPU looks?
z
IIRC there was something with the Go stdlib TLS that caused high CPU consumption in some older versions of Fleet when terminating TLS. Let me see if I can dig that up.
Anyone experiencing similar issues should generate an ECDSA certificate to alleviate performance problems with Go's TLS termination.
j
We do already use ECDSA so i dont think this is related. What i am trying to understand is why they CPU consumption would lower when adding a LB to the same scenario.
m
Random thought: Could osquery be leaving TCP connections hanging in such a way that resources get eaten up on the Fleet server for a while even after the HTTP response is sent? If so, then maybe the LB is doing some TCP cleanup (or maybe timing out long HTTP requests more quickly) that makes the problem go away?
z
@Juan Alvarez we can dig into this further if you provide us some debug archives. Please see https://fleetdm.com/docs/using-fleet/monitoring-fleet#generate-debug-archive-fleet-3-4-0. Can you make one archive from the situation with the LB and one with Fleet terminating TLS please?
j
@mikermcneil that was my thought, we have seen many connection leaving in time_wait status without reuse and there was some improvement when setting --tls_sessions_reuse to false. @zwass yes, i will do so, but i need some time to rebuild environments during my testing. Will come back to this thread with the info.
z
Thank you! We can also get some sense of how much memory is allocated to connections from that debug archive.
j
I have captured 3 snapshots of each case, and prefixed them as USINGLB and NOLB. This is a AWS c5.large instance with 1.4k endpoints running in both cases.
Also i checked same behavior, CPU is around 20% with the LB, around 50% without it.
t
oh, the LB is terminating TLS instead of fleet
j
it should not, is it?
i thought Fleet would only take TLS traffic?
t
not necessarily, osquery is the one that is required. If you have a LB, the LB terminates TLS, and then distributes in the different fleet instances
j
oh, then i misunderstood completely. I was under the impression that traffic to fleet was always TLS. I just created an AWS HTTPS LB and set the certificate there, but i did not think that the traffic would be clear from there, and that Fleet would accept it :s
t
gotcha, yeah, it's a pretty standard practice to terminate TLS at the LB, and then not worry about encryption from that point on, as it's already within your network
z
Since the Fleet server can terminate TLS you have 3 options: 1) Terminate at the LB and HTTP LB -> Fleet 2) Terminate at LB and reencrypt HTTPS LB -> Fleet 3) Pass through the LB and terminate at Fleet 1 and 2 are probably the most common
j
Thanks guys, i am trying to open the results myself, but for some reason i cant get pprof to work... So you say, that the actual situation that you are seeing in the capture is the situation 1, and right now the traffic from the LB-->Fleet is clear, and hence CPU consumption is lower. isnt it?
z
go tool pprof -http localhost:8081 profile
is what I use
(go version: 1.17)
j
oh yes, that worked 😄
ok, i can see that there is no TLS
well, thank you all guys for the help, i had a big misunderstanding and did not realize that the LB was actually terminating the TLS
🍻 1
👍 1