Title
#fleet
SK

SK

10/19/2021, 8:03 AM
After a network hiccup we are seeing thousands of these errors:
fleet[7390]: {"component":"http","err":"authentication error: find host","level":"info","path":"/api/v1/osquery/log"
We are running 4.4.0 and we had policies running but removed them as the DB became unresponsive but now we have these errors, what can we do to solve this?
8:14 AM
These are the errors we are getting at this moment the most:
"authentication error: find host"	
"updating hosts label updated at: context canceled"	
"create transaction: context canceled"
"delete label query executions: context canceled"
Any way to like reset it?
Tomas Touceda

Tomas Touceda

10/19/2021, 12:59 PM
it should reset on its own. Depending on your setup, there might be a lot of threads trying to do things in the database as hosts check in, and the hiccup made everything fail. If you want it to stop, you might need to stop fleet serve for a bit, make sure your db is stable, and then restart it. However, this is just so the logs are clean, you don't really need to do this. As connections resume, if the db is accessible through the same parameters, it'll start working
1:22 PM
I suggest you update to 4.4.2, as the policies issue was resolved
SK

SK

10/20/2021, 7:58 AM
Thanks for the feedback @Tomas Touceda Still trying to get it stable if it does not work will upgrade to 4.4.2
8:52 AM
Looking at the mysql logs I see lots of following errors, only running 1 fleet server at this moment:
Aborted connection 12345 to db: 'db' user: 'user' host: 'ip' (Got an error reading communication packets)
Aborted connection 12345 to db: 'db' user: 'user' host: 'ip' (Got an error writing communication packets)
11:24 AM
Upgrade finished, but not seeing any improvements yet, any other tips?
Tomas Touceda

Tomas Touceda

10/20/2021, 2:04 PM
could you tell me a bit more about your infrastructure? database size, type of deployment, fleet instance count/size, etc. I feel like I already asked you this question, so apologies for the repeat, but it's hard to keep track.
SK

SK

10/20/2021, 2:30 PM
Hey @Tomas Touceda No problem. We have about 10k hosts connected. 3 fleet instances talking to 1 DB, DB is 16cores and 64 GB. Till the hickup everything was working fine, after that all the time 100% CPU usage and unusable DB. Tried different things, also tried your tip regarding
delete from policies
to see if that helped. But nothing, for me it seems there is some issue with the policy data that osquery want to keep sending to the DB.
Tomas Touceda

Tomas Touceda

10/20/2021, 2:35 PM
100% CPU usage in the db, correct? do you have a list of the top queries?
SK

SK

10/20/2021, 2:36 PM
Yes on the DB correct, how is the best way to get that list? Btw on the fleet side I keep getting bombarded with
authentication error: find host
Tomas Touceda

Tomas Touceda

10/20/2021, 2:38 PM
do you have slow log enabled?
SK

SK

10/20/2021, 2:40 PM
slow_query_log
is disabled
Tomas Touceda

Tomas Touceda

10/20/2021, 2:41 PM
enabling that could be good,
SHOW FULL PROCESSLIST;
would list what's running, but there are other performance tools that might be easier depending on your setup. If you have prometheus, seeing ops and times could be useful
2:41 PM
other than that, if you could share a big chunk of the fleet serve log with debug logging enabled, that might help shed some light as to what's happening
SK

SK

10/20/2021, 2:48 PM
I did some changes per your tips, enabled slow query, see if it logs anything and I lowered
max_open_conns
and
max_idle_cons
they were 1000 and 200, now 100 and 20, DB seems to be more stable now, don't know if it will cause issues anywhere else
2:56 PM
Are there any recommended MySQL settings for a large deployment?
3:01 PM
slow_query_log
is giving me this over and over
Tomas Touceda

Tomas Touceda

10/20/2021, 3:05 PM
could you tell me the output of the following query:
SELECT TABLE_ROWS, TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'fleet'
?
SK

SK

10/20/2021, 3:07 PM
Tomas Touceda

Tomas Touceda

10/20/2021, 3:11 PM
hm, what does the following query give you:
explain SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=1
?
SK

SK

10/20/2021, 3:19 PM
Seems as if something ain't write...
Tomas Touceda

Tomas Touceda

10/20/2021, 3:21 PM
probably the host_id, grab a random id from the hosts table, and plug it at the end instead of the 1
SK

SK

10/20/2021, 3:26 PM
is that the
osquery_host_id
from hosts or
uuid
?
3:27 PM
My bad I see it is id when I checked the query.
Tomas Touceda

Tomas Touceda

10/20/2021, 3:29 PM
an example could be what comes from
select id from hosts limit 1
SK

SK

10/20/2021, 3:31 PM
censored a bit, not sure what i'm looking at 😄
Tomas Touceda

Tomas Touceda

10/20/2021, 3:33 PM
could you tell me the fleet serve config you're using? (note that this is not
fleetctl get config
, but the config you set for fleet to start
3:33 PM
the explain tells me basically that all indexes are in place, and the query is meant to be fast, so what might be happening is locking
3:33 PM
also, what size are the fleet instances?
SK

SK

10/20/2021, 3:36 PM
size in like CPU?
Tomas Touceda

Tomas Touceda

10/20/2021, 3:40 PM
CPU+RAM+disk space
SK

SK

10/20/2021, 3:41 PM
CPU: 4 cores * 3 servers RAM: 16GB HDD: 60GB
Tomas Touceda

Tomas Touceda

10/20/2021, 3:41 PM
you have a specific instance you set for vulnerability processing, correct?
SK

SK

10/20/2021, 3:41 PM
yes correct
3:42 PM
That one is not counted in above, it is a 4th server that is not connecting to any agents
Tomas Touceda

Tomas Touceda

10/20/2021, 3:43 PM
gotcha, let's try disabling vulnerability processing in that instance, restart it, and see if that calms the db down
3:43 PM
if it's locking that's the problem, you would be able to see that with something like innotop
SK

SK

10/20/2021, 3:46 PM
with disabling you mean changing the
current_instance_checks:
setting in that one server?
Tomas Touceda

Tomas Touceda

10/20/2021, 3:47 PM
yes
3:47 PM
on, but I think I see what might be causing the issue now, so silly to have missed this
SK

SK

10/20/2021, 3:48 PM
disabled the setting
3:53 PM
Did you figure out the cause?
Tomas Touceda

Tomas Touceda

10/20/2021, 3:54 PM
to help confirm that this is the case, could you run the following:
SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=<the id you used before>
and then compare the speed with the following:
SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s ON  (s.id=hs.software_id)
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=1
3:54 PM
well, there was a missing condition in the join, so it's joining with the whole table rather than filtering it with the index, so I'm guessing that's the issue
SK

SK

10/20/2021, 4:02 PM
At the moment the lower one is giving me an
Empty set
based on the id and the other one is hanging, waiting on the return at the moment
Tomas Touceda

Tomas Touceda

10/20/2021, 4:02 PM
oh, right, please change that 1 for the same id you used in the other
4:06 PM
you can kill the other query, btw
SK

SK

10/20/2021, 4:07 PM
Yes already did that, but still empty, the first query took: 2727 rows in set (1 min 59.21 sec)
4:07 PM
But the other one is still empty with the correct id
Tomas Touceda

Tomas Touceda

10/20/2021, 4:08 PM
how long did it take?
4:09 PM
we'll be cutting a 4.4.3 version today with this fix, will keep you posted. Thank you for baring with me through this debugging
SK

SK

10/20/2021, 4:11 PM
But the second query did not give me an results and finished immediately, so I don't think that is correct either right?
Tomas Touceda

Tomas Touceda

10/20/2021, 4:13 PM
do you see any vulnerabilities reported in the host details page for that host?
SK

SK

10/20/2021, 4:22 PM
Indeed no vulnerability data 😮
Tomas Touceda

Tomas Touceda

10/20/2021, 4:22 PM
yup, so this is the issue for sure
SK

SK

10/20/2021, 4:23 PM
So the other query was just returning all the vulnerabilities and not selecting based on id?
Tomas Touceda

Tomas Touceda

10/20/2021, 4:23 PM
correct
4:23 PM
we'll have a new version soon
SK

SK

10/20/2021, 4:24 PM
Understandable than that the DB is dying on me with 10k+ hosts 😮
4:31 PM
Is that query only running from the server with the vulnerability scanning enabled or running from all the servers?
Tomas Touceda

Tomas Touceda

10/20/2021, 4:32 PM
running on all of them
SK

SK

10/20/2021, 4:34 PM
That is why all my improvements had no effect, I understand it now
4:41 PM
Gonna roll back my changes waiting on the new release, thanks for your help fixing this
4:56 PM
Ping me when the new release is released, that I can test it immediately
Tomas Touceda

Tomas Touceda

10/20/2021, 4:59 PM
will do!
SK

SK

10/21/2021, 7:35 AM
Hey @Tomas Touceda I can confirm that identified query is the issue, I disabled the vulnerability check for now, waiting on the patch release and all systems are working fine again, no errors and no load issues anymore.
2:12 PM
Hey @Tomas Touceda Do you have an ETA on the release?
Tomas Touceda

Tomas Touceda

10/21/2021, 2:13 PM
ETA is today, so unless we have see anything surprising, it'll land in a few hours
7:02 PM
cc @Gavin
7:02 PM
4.4.3 is available
Gavin

Gavin

10/22/2021, 7:12 PM
:hurrah:
f

Flngen Flugen

12/06/2021, 12:12 PM
I'm still seeing this issue on kubernetes or something similar: component=http path=/api/v1/osquery/distributed/write err="authentication error: find host .... context canceled" Fleet-Webserver docker.io/fleetdm/fleet:v4.6.1, Fleet-database-mysql docker.io/mysql:8.0.27 and fleet-cache-redis docker.io/bitnami/redis:6.2.6-debian-10-r53. There are no k8s/pod limits imposed, so resources are not an issue. Cluster is pretty beefy.
ytonui

ytonui

02/10/2022, 1:19 PM
@Flngen Flugen did you resolve this issue ?