After a network hiccup we are seeing thousands of these erro osquery #fleet

After a network hiccup we are seeing thousands of ...

10/19/2021, 8:03 AM

After a network hiccup we are seeing thousands of these errors:

fleet[7390]: {"component":"http","err":"authentication error: find host","level":"info","path":"/api/v1/osquery/log"

We are running 4.4.0 and we had policies running but removed them as the DB became unresponsive but now we have these errors, what can we do to solve this?

10/19/2021, 8:14 AM

These are the errors we are getting at this moment the most:

Copy code

"authentication error: find host"	
"updating hosts label updated at: context canceled"	
"create transaction: context canceled"
"delete label query executions: context canceled"

Any way to like reset it?

Tomas Touceda

10/19/2021, 12:59 PM

it should reset on its own. Depending on your setup, there might be a lot of threads trying to do things in the database as hosts check in, and the hiccup made everything fail. If you want it to stop, you might need to stop fleet serve for a bit, make sure your db is stable, and then restart it. However, this is just so the logs are clean, you don't really need to do this. As connections resume, if the db is accessible through the same parameters, it'll start working

Tomas Touceda

10/19/2021, 1:22 PM

I suggest you update to 4.4.2, as the policies issue was resolved

10/20/2021, 7:58 AM

Thanks for the feedback @Tomas Touceda Still trying to get it stable if it does not work will upgrade to 4.4.2

10/20/2021, 8:52 AM

Looking at the mysql logs I see lots of following errors, only running 1 fleet server at this moment:

Aborted connection 12345 to db: 'db' user: 'user' host: 'ip' (Got an error reading communication packets)

Aborted connection 12345 to db: 'db' user: 'user' host: 'ip' (Got an error writing communication packets)

10/20/2021, 11:24 AM

Upgrade finished, but not seeing any improvements yet, any other tips?

Tomas Touceda

10/20/2021, 2:04 PM

could you tell me a bit more about your infrastructure? database size, type of deployment, fleet instance count/size, etc. I feel like I already asked you this question, so apologies for the repeat, but it's hard to keep track.

10/20/2021, 2:30 PM

Hey @Tomas Touceda No problem. We have about 10k hosts connected. 3 fleet instances talking to 1 DB, DB is 16cores and 64 GB. Till the hickup everything was working fine, after that all the time 100% CPU usage and unusable DB. Tried different things, also tried your tip regarding

delete from policies

to see if that helped. But nothing, for me it seems there is some issue with the policy data that osquery want to keep sending to the DB.

Tomas Touceda

10/20/2021, 2:35 PM

100% CPU usage in the db, correct? do you have a list of the top queries?

10/20/2021, 2:36 PM

Yes on the DB correct, how is the best way to get that list? Btw on the fleet side I keep getting bombarded with

authentication error: find host

Tomas Touceda

10/20/2021, 2:38 PM

do you have slow log enabled?

10/20/2021, 2:40 PM

slow_query_log

is disabled

Tomas Touceda

10/20/2021, 2:41 PM

enabling that could be good,

SHOW FULL PROCESSLIST;

would list what's running, but there are other performance tools that might be easier depending on your setup. If you have prometheus, seeing ops and times could be useful

Tomas Touceda

10/20/2021, 2:41 PM

other than that, if you could share a big chunk of the fleet serve log with debug logging enabled, that might help shed some light as to what's happening

10/20/2021, 2:48 PM

I did some changes per your tips, enabled slow query, see if it logs anything and I lowered

max_open_conns

and

max_idle_cons

they were 1000 and 200, now 100 and 20, DB seems to be more stable now, don't know if it will cause issues anywhere else

10/20/2021, 2:56 PM

Are there any recommended MySQL settings for a large deployment?

10/20/2021, 3:01 PM

slow_query_log

is giving me this over and over

Tomas Touceda

10/20/2021, 3:05 PM

could you tell me the output of the following query:

Copy code

SELECT TABLE_ROWS, TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'fleet'

10/20/2021, 3:07 PM

Tomas Touceda

10/20/2021, 3:11 PM

hm, what does the following query give you:

Copy code

explain SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=1

10/20/2021, 3:19 PM

Seems as if something ain't write...

Tomas Touceda

10/20/2021, 3:21 PM

probably the host_id, grab a random id from the hosts table, and plug it at the end instead of the 1

10/20/2021, 3:26 PM

is that the

osquery_host_id

from hosts or

uuid

10/20/2021, 3:27 PM

My bad I see it is id when I checked the query.

Tomas Touceda

10/20/2021, 3:29 PM

an example could be what comes from

select id from hosts limit 1

10/20/2021, 3:31 PM

censored a bit, not sure what i'm looking at 😄

Tomas Touceda

10/20/2021, 3:33 PM

could you tell me the fleet serve config you're using? (note that this is not

fleetctl get config

, but the config you set for fleet to start

Tomas Touceda

10/20/2021, 3:33 PM

the explain tells me basically that all indexes are in place, and the query is meant to be fast, so what might be happening is locking

Tomas Touceda

10/20/2021, 3:33 PM

also, what size are the fleet instances?

10/20/2021, 3:36 PM

size in like CPU?

Tomas Touceda

10/20/2021, 3:40 PM

CPU+RAM+disk space

10/20/2021, 3:41 PM

CPU: 4 cores * 3 servers RAM: 16GB HDD: 60GB

Tomas Touceda

10/20/2021, 3:41 PM

you have a specific instance you set for vulnerability processing, correct?

10/20/2021, 3:41 PM

yes correct

10/20/2021, 3:42 PM

That one is not counted in above, it is a 4th server that is not connecting to any agents

Tomas Touceda

10/20/2021, 3:43 PM

gotcha, let's try disabling vulnerability processing in that instance, restart it, and see if that calms the db down

Tomas Touceda

10/20/2021, 3:43 PM

if it's locking that's the problem, you would be able to see that with something like innotop

10/20/2021, 3:46 PM

with disabling you mean changing the

current_instance_checks:

setting in that one server?

Tomas Touceda

10/20/2021, 3:47 PM

yes

Tomas Touceda

10/20/2021, 3:47 PM

on, but I think I see what might be causing the issue now, so silly to have missed this

10/20/2021, 3:48 PM

disabled the setting

10/20/2021, 3:53 PM

Did you figure out the cause?

Tomas Touceda

10/20/2021, 3:54 PM

to help confirm that this is the case, could you run the following:

Copy code

SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=<the id you used before>

and then compare the speed with the following:

Copy code

SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s ON  (s.id=hs.software_id)
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=1

Tomas Touceda

10/20/2021, 3:54 PM

well, there was a missing condition in the join, so it's joining with the whole table rather than filtering it with the index, so I'm guessing that's the issue

10/20/2021, 4:02 PM

At the moment the lower one is giving me an

Empty set

based on the id and the other one is hanging, waiting on the return at the moment

Tomas Touceda

10/20/2021, 4:02 PM

oh, right, please change that 1 for the same id you used in the other

Tomas Touceda

10/20/2021, 4:06 PM

you can kill the other query, btw

10/20/2021, 4:07 PM

Yes already did that, but still empty, the first query took: 2727 rows in set (1 min 59.21 sec)

10/20/2021, 4:07 PM

But the other one is still empty with the correct id

Tomas Touceda

10/20/2021, 4:08 PM

how long did it take?

Tomas Touceda

10/20/2021, 4:09 PM

we'll be cutting a 4.4.3 version today with this fix, will keep you posted. Thank you for baring with me through this debugging

10/20/2021, 4:11 PM

But the second query did not give me an results and finished immediately, so I don't think that is correct either right?

Tomas Touceda

10/20/2021, 4:13 PM

do you see any vulnerabilities reported in the host details page for that host?

10/20/2021, 4:22 PM

Indeed no vulnerability data 😮

Tomas Touceda

10/20/2021, 4:22 PM

yup, so this is the issue for sure

10/20/2021, 4:23 PM

So the other query was just returning all the vulnerabilities and not selecting based on id?

Tomas Touceda

10/20/2021, 4:23 PM

correct

Tomas Touceda

10/20/2021, 4:23 PM

we'll have a new version soon

10/20/2021, 4:24 PM

Understandable than that the DB is dying on me with 10k+ hosts 😮

10/20/2021, 4:31 PM

Is that query only running from the server with the vulnerability scanning enabled or running from all the servers?

Tomas Touceda

10/20/2021, 4:32 PM

running on all of them

10/20/2021, 4:34 PM

That is why all my improvements had no effect, I understand it now

10/20/2021, 4:41 PM

Gonna roll back my changes waiting on the new release, thanks for your help fixing this

10/20/2021, 4:56 PM

Ping me when the new release is released, that I can test it immediately

Tomas Touceda

10/20/2021, 4:59 PM

will do!

10/21/2021, 7:35 AM

Hey @Tomas Touceda I can confirm that identified query is the issue, I disabled the vulnerability check for now, waiting on the patch release and all systems are working fine again, no errors and no load issues anymore.

👍 1

10/21/2021, 2:12 PM

Hey @Tomas Touceda Do you have an ETA on the release?

Tomas Touceda

10/21/2021, 2:13 PM

ETA is today, so unless we have see anything surprising, it'll land in a few hours

👍 1

Tomas Touceda

10/22/2021, 7:02 PM

cc @Gavin

Tomas Touceda

10/22/2021, 7:02 PM

4.4.3 is available

🎉 1

Gavin

10/22/2021, 7:12 PM

hurrah

Flngen Flugen

12/06/2021, 12:12 PM

I'm still seeing this issue on kubernetes or something similar: component=http path=/api/v1/osquery/distributed/write err="authentication error: find host .... context canceled" Fleet-Webserver docker.io/fleetdm/fleet:v4.6.1, Fleet-database-mysql docker.io/mysql:8.0.27 and fleet-cache-redis docker.io/bitnami/redis:6.2.6-debian-10-r53. There are no k8s/pod limits imposed, so resources are not an issue. Cluster is pretty beefy.

ytonui

02/10/2022, 1:19 PM

@Flngen Flugen did you resolve this issue ?

8 Views

Open in Slack

Previous Next