Title
#fleet
c

Clément Bouchard

10/25/2022, 9:47 AM
Hello fleet community, I did a fleet upgrade 3 weeks ago (from version 4.6.1 to 4.20.1) and am experiencing a lot of OOM since. I will attach some stderr logs in the thread. I tried to add memory, to bump to 4.22.0 and still facing the issue. Is anyone having the same behavior? Few things about config : • 30 fleet instances with 8GB RAM each, running behind LB (haproxy) • DB is MariaDB (I know it is not officially supported) • Managing ~40k servers (mostly Linux with few Windows) using osquery from 4.6 to last version • Software Inventory is disabled
9:51 AM
Stderr logs
{"component":"http","err":"write tcp ip:port->ip:port: i/o timeout","level":"info","path":"/api/v1/osquery/config","ts":"2022-10-25T09:31:14.587630969Z"}
2022/10/25 09:31:14 http: superfluous response.WriteHeader call from <http://github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).WriteHeader|github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).WriteHeader> (delegator.go:65)
9:54 AM
Erratum - Initial version was 4.2.3 (no issue on this one) and not 4.6.1
roberto

roberto

10/25/2022, 1:33 PM
hey there! thanks for all the details, we did coincidentally load test 4.22.0 with ~40k hosts and we didn't notice this, could you share a couple more details with us? 1. We have this guide on debugging, could you provide as many details listed there as you possibly are able to? 2. Are there any other logs around the one you posted? or it's mainly
/api/v1/osquery/config
the problem? 3. You mentioned Software inventory is disabled, however it might be enabled per team. Do you have teams set up?
c

Clément Bouchard

10/25/2022, 2:40 PM
Hello Roberto, Thank you for your reply.1. I will gather and provide you as much data as I can 2. Mostly the provided logs, with sometimes "error in query ingestion" one. I did some search regarding the log about prometheus client and it is an issue corrected in higher version of prometheus go client. Any thoughts about bumping ? 3. It is explicitly disabled in the config (pushed via fleetctl apply YAML file) and we don"t use teams. If that's help, UI is mentioning that it is disable on the software tab. Additional info, we are also running fleet on a preprod environnement with few host, and it is running without issue.
roberto

roberto

10/25/2022, 2:50 PM
Thank you! 1. sounds good, thanks again 🙂 2. interesting, I will create an issue to bump prometheus to avoid filling up the logs with that, but as far as I can tell that shouldn't be the source of your problems. Are you able to share more logs? It'd be interesting to see what other stuff is happening around the errors you see. Even a couple of hours of logging should be enough 3. understood, thanks!
3:55 PM
@Clément Bouchard another question: are you using packs? or additional queries? do you happen to have a lot of packs or packs with a lot of queries?
c

Clément Bouchard

10/25/2022, 5:11 PM
4-5 packs and about 20 customs queries. I tried to remove all of it, install it on a new fresh DB, same issue.
6:32 AM
Well, I wanted to ensure that the database wasn't the issue and was about to migrate to one I manage. Prior to that, I removed the db duplica configuration, and it seems to solve the issue. See the memory graph below
6:37 AM
Weird thing is that I did enable db duplica a week ago, in order to improve perf and solve the OOM issue. Meanwhile, I did upgraded from 4.20.1 to 4.22.0.
roberto

roberto

10/26/2022, 7:09 PM
wow, interesting. Do you mind if I create an issue with this information? (possibly including the screenshot?) this is definitely something I'd like to investigate using MySQL instad of MariaDB
c

Clément Bouchard

10/27/2022, 6:53 AM
Sure, let me know if I can help you reproducing the issue.