Hi all i have noticed that CPU consumption in FleetDM side i osquery #fleet

Hi all, i have noticed that CPU consumption in Fle...

Juan Alvarez

09/21/2021, 11:23 AM

Hi all, i have noticed that CPU consumption in FleetDM side is way lower when i put a LB (AWS ALB in this case) between FleetDM and osquery. Have anybody noticed the same behavior? The LB does not terminate TLS so i am trying to understand why CPU consumption differs so much

Tomas Touceda

09/21/2021, 1:39 PM

hi Juan, that's very odd. Does the LB add some delay potentially to the requests making it less intense on the fleet side? Could you share your fleet configs so that we can see if there's anything that could cause high CPU usage?

Juan Alvarez

09/21/2021, 2:45 PM

I think all the configs are pretty normal, i have some custom configs that i have removed mostly related to a logger plugin that we have developed (i removed those for clarity). I am using FleetDM 3.13.

Copy code

FLEET_AUTH_JWT_KEY='******************************'
FLEET_REDIS_ADDRESS='localhost:6379'
FLEET_REDIS_PASSWORD=''
FLEET_REDIS_DATABASE='0'
FLEET_MYSQL_ADDRESS='localhost:3306'
FLEET_MYSQL_DATABASE='*****'
FLEET_MYSQL_USERNAME='*****'
FLEET_MYSQL_PASSWORD='*****'
FLEET_MYSQL_MAX_IDLE_CONNS='50'
FLEET_MYSQL_MAX_OPEN_CONNS='100'
FLEET_SERVER_CERT=/etc/devo-ea-manager/certs/devo-ea-manager.crt
FLEET_SERVER_KEY=/etc/devo-ea-manager/certs/devo-ea-manager.key
FLEET_LOGGING_JSON=true
FLEET_OSQUERY_RESULT_LOG_PLUGIN=devo
FLEET_OSQUERY_STATUS_LOG_PLUGIN=devo
FLEET_OSQUERY_LABEL_UPDATE_INTERVAL='1h'

Tomas Touceda

09/21/2021, 2:55 PM

gotcha, not sure with that version, there have been a lot of changes since then

Juan Alvarez

09/21/2021, 3:04 PM

I am running eveything in the same box (in AWS in a c5.large), with redis and sql in local in a docker. Without a LB i get 50% CPU average with 1.4k endpoints, but with the LB CPU averages under 20%

Tomas Touceda

09/21/2021, 3:09 PM

I'll ask around on my end to see if any ideas come up. Could you share logs for fleet without LB and with LB?

Juan Alvarez

09/21/2021, 3:13 PM

There are not really logs to share, since fleet is not throwing any errors

Juan Alvarez

09/21/2021, 3:15 PM

Another piece of info, is that we use the fleet server to send data to our SIEM via the custom logger we made. So, we use fleet for config, disributed and submitlogs as well

Juan Alvarez

09/21/2021, 3:15 PM

Most of the traffic is about this last thing. We only read configs every 900 seconds and distributed queries every minute

Tomas Touceda

09/21/2021, 3:50 PM

There are not really logs to share, since fleet is not throwing any errors

well, my goal was to compare timestamps on some endpoints to see if there could be a slowdown happening

Tomas Touceda

09/21/2021, 3:51 PM

roughly every minute might be too often, have you consider getting upping those and seeing how CPU looks?

zwass

09/21/2021, 4:09 PM

IIRC there was something with the Go stdlib TLS that caused high CPU consumption in some older versions of Fleet when terminating TLS. Let me see if I can dig that up.

zwass

09/21/2021, 4:11 PM

https://github.com/fleetdm/fleet/issues/655

zwass

09/21/2021, 4:24 PM

Anyone experiencing similar issues should generate an ECDSA certificate to alleviate performance problems with Go's TLS termination.

zwass

09/21/2021, 4:29 PM

fwiw I filed https://github.com/fleetdm/fleet/issues/2159

Juan Alvarez

09/22/2021, 7:21 AM

We do already use ECDSA so i dont think this is related. What i am trying to understand is why they CPU consumption would lower when adding a LB to the same scenario.

mikermcneil

09/23/2021, 3:48 AM

Random thought: Could osquery be leaving TCP connections hanging in such a way that resources get eaten up on the Fleet server for a while even after the HTTP response is sent? If so, then maybe the LB is doing some TCP cleanup (or maybe timing out long HTTP requests more quickly) that makes the problem go away?

zwass

09/23/2021, 3:39 PM

@Juan Alvarez we can dig into this further if you provide us some debug archives. Please see https://fleetdm.com/docs/using-fleet/monitoring-fleet#generate-debug-archive-fleet-3-4-0. Can you make one archive from the situation with the LB and one with Fleet terminating TLS please?

Juan Alvarez

09/23/2021, 3:42 PM

@mikermcneil that was my thought, we have seen many connection leaving in time_wait status without reuse and there was some improvement when setting --tls_sessions_reuse to false. @zwass yes, i will do so, but i need some time to rebuild environments during my testing. Will come back to this thread with the info.

zwass

09/23/2021, 3:52 PM

Thank you! We can also get some sense of how much memory is allocated to connections from that debug archive.

Juan Alvarez

09/24/2021, 3:12 PM

I have captured 3 snapshots of each case, and prefixed them as USINGLB and NOLB. This is a AWS c5.large instance with 1.4k endpoints running in both cases.

debug archives.zip

Juan Alvarez

09/24/2021, 3:13 PM

Also i checked same behavior, CPU is around 20% with the LB, around 50% without it.

Tomas Touceda

09/24/2021, 3:19 PM

oh, the LB is terminating TLS instead of fleet

Juan Alvarez

09/24/2021, 3:26 PM

it should not, is it?

Juan Alvarez

09/24/2021, 3:27 PM

i thought Fleet would only take TLS traffic?

Tomas Touceda

09/24/2021, 3:28 PM

not necessarily, osquery is the one that is required. If you have a LB, the LB terminates TLS, and then distributes in the different fleet instances

Juan Alvarez

09/24/2021, 3:34 PM

oh, then i misunderstood completely. I was under the impression that traffic to fleet was always TLS. I just created an AWS HTTPS LB and set the certificate there, but i did not think that the traffic would be clear from there, and that Fleet would accept it :s

Tomas Touceda

09/24/2021, 3:54 PM

gotcha, yeah, it's a pretty standard practice to terminate TLS at the LB, and then not worry about encryption from that point on, as it's already within your network

zwass

09/24/2021, 4:38 PM

Since the Fleet server can terminate TLS you have 3 options: 1) Terminate at the LB and HTTP LB -> Fleet 2) Terminate at LB and reencrypt HTTPS LB -> Fleet 3) Pass through the LB and terminate at Fleet 1 and 2 are probably the most common

Juan Alvarez

09/24/2021, 4:52 PM

Thanks guys, i am trying to open the results myself, but for some reason i cant get pprof to work... So you say, that the actual situation that you are seeing in the capture is the situation 1, and right now the traffic from the LB-->Fleet is clear, and hence CPU consumption is lower. isnt it?

zwass

09/24/2021, 5:08 PM

go tool pprof -http localhost:8081 profile

is what I use

zwass

09/24/2021, 5:08 PM

(go version: 1.17)

Juan Alvarez

09/24/2021, 5:10 PM

oh yes, that worked 😄

Juan Alvarez

09/24/2021, 5:12 PM

ok, i can see that there is no TLS

Juan Alvarez

09/24/2021, 5:13 PM

well, thank you all guys for the help, i had a big misunderstanding and did not realize that the LB was actually terminating the TLS

🍻 1

👍 1

Open in Slack

Previous Next