Hey folks One of our fleet servers is experiencing a storm o osquery #fleet

Hey folks. One of our fleet servers is experiencin...

Ayan

01/07/2022, 3:39 PM

Hey folks. One of our fleet servers is experiencing a storm of

authentication error: find host

errors and all of the osquery agents are having connection timed out. I looked into the previous conversations related with this error in this Slack but could not really find anything relative. The fleet server is on

4.4.3

. Any guidance possible?

Ayan

01/07/2022, 4:09 PM

Debug logs are showing all kinds of context cancelled events. An example:

level=debug ts=2022-01-07T16:08:23.36255733Z component=http method=POST uri=/api/v1/osquery/distributed/read took=15.943184963s ip_addr=<IP ADDR>:37838 x_for_ip_addr= err="retrieving policy queries: selecting policies for host: context canceled"

👀 1

zwass

01/07/2022, 4:10 PM

Possibly some communication issue with the MySQL server? You have any metrics on MySQL?

Ayan

01/07/2022, 4:14 PM

That's my suspicion as well. I checked the

slow_query

logs and theres nothing there. General logs show a lot of queries from all fleet servers including the one in discussion. Basically we have about 6K hosts enrolled with 12 fleet servers that all share 1 redis and 1 mariadb. The db server is a 16 core 64 gig virtualized server.

zwass

01/07/2022, 4:16 PM

Sounds like it should be enough... How is the CPU utilization on the MySQL server?

zwass

01/07/2022, 4:16 PM

Actually sounds like it should be way more than enough.

Ayan

01/07/2022, 4:18 PM

Cpu utilization does not rise beyond 15%

zwass

01/07/2022, 4:18 PM

Can you upgrade to 4.8.0 before we debug further?

Ayan

01/07/2022, 4:19 PM

This fleet server has about 500 devices connected to it's not the one with the most connections either. Config is all defaults in terms of msql connections.

Ayan

01/07/2022, 4:22 PM

We can definitely upgrade although I would like to ask if there's anything specific in 4.8.0 that could help this? We've had some trouble with new functionalities enabled by default in the upgrades so we need to read through the release notes to avoid incidents.

zwass

01/07/2022, 4:26 PM

Each of our recent releases has had focus on performance and reliability. We're also trying not to do too much debugging on older releases as we often find that the root cause has been resolved by a newer release.

Ayan

01/07/2022, 4:27 PM

That makes total sense. Do you recommend end users the same practice as well?

zwass

01/07/2022, 4:30 PM

If you're stable and not eager for any new features, by all means stick with what's working well for you. When you run into trouble, we definitely recommend upgrading as a troubleshooting step.

Ayan

01/07/2022, 4:56 PM

Ok I'll update here upgrading. Thank you!

ty 1

Ayan

02/11/2022, 6:37 PM

So I kept troubleshooting this issue for quite some time. Found a lot of locks happening in the db on the

hosts

table. For anyone who's facing similar issues, an upgrade from Fleet 4.8 to Fleet 4.9.1 fixed the locks and the related errors by itself automatically for us. Thanks @zwass

ty 2

🙌 3

zwass

02/11/2022, 7:57 PM

Great to hear this! 4.9 involved a lot of performance optimization, particularly around the database.

5 Views

Open in Slack

Previous Next