:wave: ive been having a chronic issue for a while...
# fleet
k
đź‘‹ ive been having a chronic issue for a while with the software page in self-hosted fleet. it will never load completely and our RDS instance will run up to 100% cpu usage. i recently upgraded the instance from
db.t3.medium
to
db.t4g.large
and still seeing performance issues. i temporarily bumped it up to a
db.r5.xlarge
and while the cpu didnt max out, the page still spun. we only have less than 300 hosts, so i am not sure how complex this query that is running is. any ideas on how to trace down what is causing the slowness and how to address?
taking a closer look here, it appears the software is loading. but the
count
api call just will not get a response back. the only log i see is a
context cancelled
after i close the tab
something is getting this panic, but the orbit piece makes me think its unrelated.
Copy code
2023/02/09 05:05:43 http: panic serving 10.6.95.127:56528: runtime error: invalid memory address or nil pointer dereference
goroutine 5284 [running]:
net/http.(*conn).serve.func1()
	net/http/server.go:1850 +0xbf
panic({0x1b3cbc0, 0x3022060})
	runtime/panic.go:890 +0x262
<http://github.com/fleetdm/fleet/v4/server/datastore/mysql.(*Datastore).LoadHostByOrbitNodeKey(0x11008a4|github.com/fleetdm/fleet/v4/server/datastore/mysql.(*Datastore).LoadHostByOrbitNodeKey(0x11008a4>?, {0x22a4058, 0xc001285980}, {0xc001438680, 0x20})
	<http://github.com/fleetdm/fleet/v4/server/datastore/mysql/hosts.go:1299|github.com/fleetdm/fleet/v4/server/datastore/mysql/hosts.go:1299> +0x2ef
<http://github.com/fleetdm/fleet/v4/server/service.(*Service).AuthenticateOrbitHost(0xc0005a7800|github.com/fleetdm/fleet/v4/server/service.(*Service).AuthenticateOrbitHost(0xc0005a7800>, {0x22a4058, 0xc001285980}, {0xc001438680, 0x20})
	<http://github.com/fleetdm/fleet/v4/server/service/orbit.go:88|github.com/fleetdm/fleet/v4/server/service/orbit.go:88> +0x85
<http://github.com/fleetdm/fleet/v4/server/service.authenticatedOrbitHost.func1(|github.com/fleetdm/fleet/v4/server/service.authenticatedOrbitHost.func1(>{0x22a4058, 0xc001285980}, {0x1b4dce0, 0xc000588e70})
	<http://github.com/fleetdm/fleet/v4/server/service/endpoint_middleware.go:132|github.com/fleetdm/fleet/v4/server/service/endpoint_middleware.go:132> +0xb7
<http://github.com/fleetdm/fleet/v4/server/service.logged.func1({0x22a4058|github.com/fleetdm/fleet/v4/server/service.logged.func1({0x22a4058>, 0xc001285980}, {0x1b4dce0?, 0xc000588e70?})
	<http://github.com/fleetdm/fleet/v4/server/service/endpoint_middleware.go:225|github.com/fleetdm/fleet/v4/server/service/endpoint_middleware.go:225> +0x35
<http://github.com/fleetdm/fleet/v4/server/service/middleware/authzcheck.(*Middleware).AuthzCheck.func1.1({0x22a4058|github.com/fleetdm/fleet/v4/server/service/middleware/authzcheck.(*Middleware).AuthzCheck.func1.1({0x22a4058>, 0xc001285920}, {0x1b4dce0, 0xc000588e70})
	<http://github.com/fleetdm/fleet/v4/server/service/middleware/authzcheck/authzcheck.go:31|github.com/fleetdm/fleet/v4/server/service/middleware/authzcheck/authzcheck.go:31> +0xa2
<http://github.com/go-kit/kit/transport/http.Server.ServeHTTP({0xc000ae91f0|github.com/go-kit/kit/transport/http.Server.ServeHTTP({0xc000ae91f0>, 0xc000a6cbb8, 0x1f07388, {0xc000ab9b60, 0x3, 0x4}, {0xc000b091a0, 0x4, 0x6}, 0x1f07380, ...}, ...)
	<http://github.com/go-kit/kit@v0.12.0/transport/http/server.go:121|github.com/go-kit/kit@v0.12.0/transport/http/server.go:121> +0x35b
<http://github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerRequestSize.func2({0x7fb47c4610f8|github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerRequestSize.func2({0x7fb47c4610f8>?, 0xc001280780?}, 0xc001283000)
	<http://github.com/prometheus/client_golang@v1.13.0/prometheus/promhttp/instrument_server.go:245|github.com/prometheus/client_golang@v1.13.0/prometheus/promhttp/instrument_server.go:245> +0x77
net/http.HandlerFunc.ServeHTTP(0x7fb47c4610f8?, {0x7fb47c4610f8?, 0xc001280780?}, 0xc001285530?)
	net/http/server.go:2109 +0x2f
<http://github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1(|github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1(>{0x7fb47c4610f8?, 0xc001280730?}, 0xc001283000)
	<http://github.com/prometheus/client_golang@v1.13.0/prometheus/promhttp/instrument_server.go:284|github.com/prometheus/client_golang@v1.13.0/prometheus/promhttp/instrument_server.go:284> +0xc5
net/http.HandlerFunc.ServeHTTP(0x22a1a20?, {0x7fb47c4610f8?, 0xc001280730?}, 0x203000?)
	net/http/server.go:2109 +0x2f
<http://github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1(|github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1(>{0x22a1a20?, 0xc00123a7e0?}, 0xc001283000)
	<http://github.com/prometheus/client_golang@v1.13.0/prometheus/promhttp/instrument_server.go:142|github.com/prometheus/client_golang@v1.13.0/prometheus/promhttp/instrument_server.go:142> +0xb8
net/http.HandlerFunc.ServeHTTP(0x0?, {0x22a1a20?, 0xc00123a7e0?}, 0x7fb47c5160f8?)
	net/http/server.go:2109 +0x2f
<http://github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2(|github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2(>{0x22a1a20, 0xc00123a7e0}, 0xc001283000)
	<http://github.com/prometheus/client_golang@v1.13.0/prometheus/promhttp/instrument_server.go:104|github.com/prometheus/client_golang@v1.13.0/prometheus/promhttp/instrument_server.go:104> +0xbf
net/http.HandlerFunc.ServeHTTP(0x22a4058?, {0x22a1a20?, 0xc00123a7e0?}, 0x22884c0?)
	net/http/server.go:2109 +0x2f
<http://github.com/fleetdm/fleet/v4/server/service.publicIP.func1(|github.com/fleetdm/fleet/v4/server/service.publicIP.func1(>{0x22a1a20, 0xc00123a7e0}, 0xc001282f00)
	<http://github.com/fleetdm/fleet/v4/server/service/handler.go:152|github.com/fleetdm/fleet/v4/server/service/handler.go:152> +0x1ae
net/http.HandlerFunc.ServeHTTP(0xc001282e00?, {0x22a1a20?, 0xc00123a7e0?}, 0xc00175b958?)
	net/http/server.go:2109 +0x2f
<http://github.com/gorilla/mux.(*Router).ServeHTTP(0xc000376b40|github.com/gorilla/mux.(*Router).ServeHTTP(0xc000376b40>, {0x22a1a20, 0xc00123a7e0}, 0xc001282c00)
	<http://github.com/gorilla/mux@v1.8.0/mux.go:210|github.com/gorilla/mux@v1.8.0/mux.go:210> +0x1cf
net/http.(*ServeMux).ServeHTTP(0x0?, {0x22a1a20, 0xc00123a7e0}, 0xc001282c00)
	net/http/server.go:2487 +0x149
<http://github.com/fleetdm/fleet/v4/server/launcher.(*Handler).Handler.func1(|github.com/fleetdm/fleet/v4/server/launcher.(*Handler).Handler.func1(>{0x22a1a20, 0xc00123a7e0}, 0xc001282c00)
	<http://github.com/fleetdm/fleet/v4/server/launcher/server.go:54|github.com/fleetdm/fleet/v4/server/launcher/server.go:54> +0x1b9
net/http.HandlerFunc.ServeHTTP(0x0?, {0x22a1a20?, 0xc00123a7e0?}, 0x7204f4?)
	net/http/server.go:2109 +0x2f
net/http.serverHandler.ServeHTTP({0x229d340?}, {0x22a1a20, 0xc00123a7e0}, 0xc001282c00)
	net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc00053e000, {0x22a4058, 0xc000cf4210})
	net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
	net/http/server.go:3102 +0x4db
k
Hi @kyle ! How much memory do you have allocated for your Fleet instance?
z
Also, what version of Fleet are you running? Even your
t3.medium
should be enough for 300 hosts I would think.
k
we’re running the latest, 4.27.0. in my helm chart i think at minimum it is getting 1gb and can scale up to 8gb
i can prob change the minimum to something higher
k
Can you check the Fleet logs for anything related to vulnerabilities? Also, verify that the path set for the vulnerabilities database is present and that Fleet has access. I'll take that error to the team.
k
yeah, i can take a look
we have an issue with the vuln db before because the containers were getting created with read only filesystems
@Kathy Satterlee i’ve taken a look at the logs i have and haven’t seen anything related to vulnerabilities. i have debug logging on, anything else to check? it’s definitely just that count api call thats causing issues.
/api/latest/fleet/software/count?scope=softwareCount&vulnerable=false
k
Have you been able to check that the permissions are correct for the vulnerability database?
Hey @kyle, wanted to check in and see how things are going.
k
appreciate the checkin @Kathy Satterlee ! i made sure that we’re vulnerability databases are downloading, they’re kept in /tmp/vuln iirc but we’re still seeing that count api never load and maxing out the cpu of our mysql server
it hasn’t been as high priority for me, so i haven’t been actively troubleshooting
k
Glad it's not critical for you! When you get the chance, can you try manually triggering the vulnerability scan using
fleetctl
?
Copy code
fleetctl trigger --name vulnerabilities
Keep an eye on the logs and let me know what pops up.
I suspect these issues may be related to this issue. I saw a similar panic related to the cleanup cron not running properly, and it would also cause the software counts not to update (that's the last thing that happens in the vuln processing).
k
ah i see, i ran the command and the first time i got
Copy code
kyle@MacBook-Pro-3 ~ % fleetctl trigger --name vulnerabilities
[+] Sent request to trigger vulnerabilities schedule
but now see
Copy code
kyle@MacBook-Pro-3 ~ % fleetctl trigger --name vulnerabilities
[!] Conflicts with current status of vulnerabilities schedule: triggered run started 8m6.413s ago
for the suggested fix, what do you mean by
"unlock" and trigger the job that does the cleanups
?
i took at a look at the sql query, and it looks like it wouldn’t apply to anything in the current
cron_stats
table
k
Try
fleetctl trigger --name cleanups_then_aggregation
k
looks like it invoked it and it completed
the
vulnerabilities
job also completed, according to
cron_stats
table
k
Is there any chance you're missing a database migration?
k
it is very possible
we use the helm chart, and i notice the db migrations container running. but i have skipped versions when upgrading.
but i guess, then the container should see any that weren’t applied and fix?
k
Really all that matters is the last run of migrations.
k
following up on this for posterity if anyone runs across this again, i updated fleet to v4.31.0 and all seems to be good now.