Hello fleet community, I did a fleet upgrade 3 wee...
# fleet
Hello fleet community, I did a fleet upgrade 3 weeks ago (from version 4.6.1 to 4.20.1) and am experiencing a lot of OOM since. I will attach some stderr logs in the thread. I tried to add memory, to bump to 4.22.0 and still facing the issue. Is anyone having the same behavior? Few things about config : • 30 fleet instances with 8GB RAM each, running behind LB (haproxy) • DB is MariaDB (I know it is not officially supported) • Managing ~40k servers (mostly Linux with few Windows) using osquery from 4.6 to last version • Software Inventory is disabled
Stderr logs
Copy code
{"component":"http","err":"write tcp ip:port->ip:port: i/o timeout","level":"info","path":"/api/v1/osquery/config","ts":"2022-10-25T09:31:14.587630969Z"}
2022/10/25 09:31:14 http: superfluous response.WriteHeader call from <http://github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).WriteHeader|github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).WriteHeader> (delegator.go:65)
Erratum - Initial version was 4.2.3 (no issue on this one) and not 4.6.1
hey there! thanks for all the details, we did coincidentally load test 4.22.0 with ~40k hosts and we didn't notice this, could you share a couple more details with us? 1. We have this guide on debugging, could you provide as many details listed there as you possibly are able to? 2. Are there any other logs around the one you posted? or it's mainly
the problem? 3. You mentioned Software inventory is disabled, however it might be enabled per team. Do you have teams set up?
Hello Roberto, Thank you for your reply. 1. I will gather and provide you as much data as I can 2. Mostly the provided logs, with sometimes "error in query ingestion" one. I did some search regarding the log about prometheus client and it is an issue corrected in higher version of prometheus go client. Any thoughts about bumping ? 3. It is explicitly disabled in the config (pushed via fleetctl apply YAML file) and we don"t use teams. If that's help, UI is mentioning that it is disable on the software tab. Additional info, we are also running fleet on a preprod environnement with few host, and it is running without issue.
Thank you! 1. sounds good, thanks again 🙂 2. interesting, I will create an issue to bump prometheus to avoid filling up the logs with that, but as far as I can tell that shouldn't be the source of your problems. Are you able to share more logs? It'd be interesting to see what other stuff is happening around the errors you see. Even a couple of hours of logging should be enough 3. understood, thanks!
@Clément Bouchard another question: are you using packs? or additional queries? do you happen to have a lot of packs or packs with a lot of queries?
4-5 packs and about 20 customs queries. I tried to remove all of it, install it on a new fresh DB, same issue.
Well, I wanted to ensure that the database wasn't the issue and was about to migrate to one I manage. Prior to that, I removed the db duplica configuration, and it seems to solve the issue. See the memory graph below
Weird thing is that I did enable db duplica a week ago, in order to improve perf and solve the OOM issue. Meanwhile, I did upgraded from 4.20.1 to 4.22.0.
wow, interesting. Do you mind if I create an issue with this information? (possibly including the screenshot?) this is definitely something I'd like to investigate using MySQL instad of MariaDB
Sure, let me know if I can help you reproducing the issue.