Title
#fleet
Ayan

Ayan

01/07/2022, 3:39 PM
Hey folks. One of our fleet servers is experiencing a storm of
authentication error: find host
errors and all of the osquery agents are having connection timed out. I looked into the previous conversations related with this error in this Slack but could not really find anything relative. The fleet server is on
4.4.3
. Any guidance possible?
4:09 PM
Debug logs are showing all kinds of context cancelled events. An example:
level=debug ts=2022-01-07T16:08:23.36255733Z component=http method=POST uri=/api/v1/osquery/distributed/read took=15.943184963s ip_addr=<IP ADDR>:37838 x_for_ip_addr= err="retrieving policy queries: selecting policies for host: context canceled"
zwass

zwass

01/07/2022, 4:10 PM
Possibly some communication issue with the MySQL server? You have any metrics on MySQL?
Ayan

Ayan

01/07/2022, 4:14 PM
That's my suspicion as well. I checked the
slow_query
logs and theres nothing there. General logs show a lot of queries from all fleet servers including the one in discussion. Basically we have about 6K hosts enrolled with 12 fleet servers that all share 1 redis and 1 mariadb. The db server is a 16 core 64 gig virtualized server.
zwass

zwass

01/07/2022, 4:16 PM
Sounds like it should be enough... How is the CPU utilization on the MySQL server?
4:16 PM
Actually sounds like it should be way more than enough.
Ayan

Ayan

01/07/2022, 4:18 PM
Cpu utilization does not rise beyond 15%
zwass

zwass

01/07/2022, 4:18 PM
Can you upgrade to 4.8.0 before we debug further?
Ayan

Ayan

01/07/2022, 4:19 PM
This fleet server has about 500 devices connected to it's not the one with the most connections either. Config is all defaults in terms of msql connections.
4:22 PM
We can definitely upgrade although I would like to ask if there's anything specific in 4.8.0 that could help this? We've had some trouble with new functionalities enabled by default in the upgrades so we need to read through the release notes to avoid incidents.
zwass

zwass

01/07/2022, 4:26 PM
Each of our recent releases has had focus on performance and reliability. We're also trying not to do too much debugging on older releases as we often find that the root cause has been resolved by a newer release.
Ayan

Ayan

01/07/2022, 4:27 PM
That makes total sense. Do you recommend end users the same practice as well?
zwass

zwass

01/07/2022, 4:30 PM
If you're stable and not eager for any new features, by all means stick with what's working well for you. When you run into trouble, we definitely recommend upgrading as a troubleshooting step.
Ayan

Ayan

01/07/2022, 4:56 PM
Ok I'll update here upgrading. Thank you!
6:37 PM
So I kept troubleshooting this issue for quite some time. Found a lot of locks happening in the db on the
hosts
table. For anyone who's facing similar issues, an upgrade from Fleet 4.8 to Fleet 4.9.1 fixed the locks and the related errors by itself automatically for us. Thanks @zwass
zwass

zwass

02/11/2022, 7:57 PM
Great to hear this! 4.9 involved a lot of performance optimization, particularly around the database.