Hey folks. One of our fleet servers is experiencin...
# fleet
a
Hey folks. One of our fleet servers is experiencing a storm of
authentication error: find host
errors and all of the osquery agents are having connection timed out. I looked into the previous conversations related with this error in this Slack but could not really find anything relative. The fleet server is on
4.4.3
. Any guidance possible?
Debug logs are showing all kinds of context cancelled events. An example:
level=debug ts=2022-01-07T16:08:23.36255733Z component=http method=POST uri=/api/v1/osquery/distributed/read took=15.943184963s ip_addr=<IP ADDR>:37838 x_for_ip_addr= err="retrieving policy queries: selecting policies for host: context canceled"
👀 1
z
Possibly some communication issue with the MySQL server? You have any metrics on MySQL?
a
That's my suspicion as well. I checked the
slow_query
logs and theres nothing there. General logs show a lot of queries from all fleet servers including the one in discussion. Basically we have about 6K hosts enrolled with 12 fleet servers that all share 1 redis and 1 mariadb. The db server is a 16 core 64 gig virtualized server.
z
Sounds like it should be enough... How is the CPU utilization on the MySQL server?
Actually sounds like it should be way more than enough.
a
Cpu utilization does not rise beyond 15%
z
Can you upgrade to 4.8.0 before we debug further?
a
This fleet server has about 500 devices connected to it's not the one with the most connections either. Config is all defaults in terms of msql connections.
We can definitely upgrade although I would like to ask if there's anything specific in 4.8.0 that could help this? We've had some trouble with new functionalities enabled by default in the upgrades so we need to read through the release notes to avoid incidents.
z
Each of our recent releases has had focus on performance and reliability. We're also trying not to do too much debugging on older releases as we often find that the root cause has been resolved by a newer release.
a
That makes total sense. Do you recommend end users the same practice as well?
z
If you're stable and not eager for any new features, by all means stick with what's working well for you. When you run into trouble, we definitely recommend upgrading as a troubleshooting step.
a
Ok I'll update here upgrading. Thank you!
ty 1
So I kept troubleshooting this issue for quite some time. Found a lot of locks happening in the db on the
hosts
table. For anyone who's facing similar issues, an upgrade from Fleet 4.8 to Fleet 4.9.1 fixed the locks and the related errors by itself automatically for us. Thanks @zwass
ty 2
🙌 3
z
Great to hear this! 4.9 involved a lot of performance optimization, particularly around the database.