After a network hiccup we are seeing thousands of ...
# fleet
s
After a network hiccup we are seeing thousands of these errors:
fleet[7390]: {"component":"http","err":"authentication error: find host","level":"info","path":"/api/v1/osquery/log"
We are running 4.4.0 and we had policies running but removed them as the DB became unresponsive but now we have these errors, what can we do to solve this?
These are the errors we are getting at this moment the most:
Copy code
"authentication error: find host"	
"updating hosts label updated at: context canceled"	
"create transaction: context canceled"
"delete label query executions: context canceled"
Any way to like reset it?
t
it should reset on its own. Depending on your setup, there might be a lot of threads trying to do things in the database as hosts check in, and the hiccup made everything fail. If you want it to stop, you might need to stop fleet serve for a bit, make sure your db is stable, and then restart it. However, this is just so the logs are clean, you don't really need to do this. As connections resume, if the db is accessible through the same parameters, it'll start working
I suggest you update to 4.4.2, as the policies issue was resolved
s
Thanks for the feedback @Tomas Touceda Still trying to get it stable if it does not work will upgrade to 4.4.2
Looking at the mysql logs I see lots of following errors, only running 1 fleet server at this moment:
Aborted connection 12345 to db: 'db' user: 'user' host: 'ip' (Got an error reading communication packets)
Aborted connection 12345 to db: 'db' user: 'user' host: 'ip' (Got an error writing communication packets)
Upgrade finished, but not seeing any improvements yet, any other tips?
t
could you tell me a bit more about your infrastructure? database size, type of deployment, fleet instance count/size, etc. I feel like I already asked you this question, so apologies for the repeat, but it's hard to keep track.
s
Hey @Tomas Touceda No problem. We have about 10k hosts connected. 3 fleet instances talking to 1 DB, DB is 16cores and 64 GB. Till the hickup everything was working fine, after that all the time 100% CPU usage and unusable DB. Tried different things, also tried your tip regarding
delete from policies
to see if that helped. But nothing, for me it seems there is some issue with the policy data that osquery want to keep sending to the DB.
t
100% CPU usage in the db, correct? do you have a list of the top queries?
s
Yes on the DB correct, how is the best way to get that list? Btw on the fleet side I keep getting bombarded with
authentication error: find host
t
do you have slow log enabled?
s
slow_query_log
is disabled
t
enabling that could be good,
SHOW FULL PROCESSLIST;
would list what's running, but there are other performance tools that might be easier depending on your setup. If you have prometheus, seeing ops and times could be useful
other than that, if you could share a big chunk of the fleet serve log with debug logging enabled, that might help shed some light as to what's happening
s
I did some changes per your tips, enabled slow query, see if it logs anything and I lowered
max_open_conns
and
max_idle_cons
they were 1000 and 200, now 100 and 20, DB seems to be more stable now, don't know if it will cause issues anywhere else
Are there any recommended MySQL settings for a large deployment?
slow_query_log
is giving me this over and over
t
could you tell me the output of the following query:
Copy code
SELECT TABLE_ROWS, TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'fleet'
?
s
t
hm, what does the following query give you:
Copy code
explain SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=1
?
s
Seems as if something ain't write...
t
probably the host_id, grab a random id from the hosts table, and plug it at the end instead of the 1
s
is that the
osquery_host_id
from hosts or
uuid
?
My bad I see it is id when I checked the query.
t
an example could be what comes from
select id from hosts limit 1
s
censored a bit, not sure what i'm looking at 😄
t
could you tell me the fleet serve config you're using? (note that this is not
fleetctl get config
, but the config you set for fleet to start
the explain tells me basically that all indexes are in place, and the query is meant to be fast, so what might be happening is locking
also, what size are the fleet instances?
s
size in like CPU?
t
CPU+RAM+disk space
s
CPU: 4 cores * 3 servers RAM: 16GB HDD: 60GB
t
you have a specific instance you set for vulnerability processing, correct?
s
yes correct
That one is not counted in above, it is a 4th server that is not connecting to any agents
t
gotcha, let's try disabling vulnerability processing in that instance, restart it, and see if that calms the db down
if it's locking that's the problem, you would be able to see that with something like innotop
s
with disabling you mean changing the
current_instance_checks:
setting in that one server?
t
yes
on, but I think I see what might be causing the issue now, so silly to have missed this
s
disabled the setting
Did you figure out the cause?
t
to help confirm that this is the case, could you run the following:
Copy code
SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=<the id you used before>
and then compare the speed with the following:
Copy code
SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s ON  (s.id=hs.software_id)
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=1
well, there was a missing condition in the join, so it's joining with the whole table rather than filtering it with the index, so I'm guessing that's the issue
s
At the moment the lower one is giving me an
Empty set
based on the id and the other one is hanging, waiting on the return at the moment
t
oh, right, please change that 1 for the same id you used in the other
you can kill the other query, btw
s
Yes already did that, but still empty, the first query took: 2727 rows in set (1 min 59.21 sec)
But the other one is still empty with the correct id
t
how long did it take?
we'll be cutting a 4.4.3 version today with this fix, will keep you posted. Thank you for baring with me through this debugging
s
But the second query did not give me an results and finished immediately, so I don't think that is correct either right?
t
do you see any vulnerabilities reported in the host details page for that host?
s
Indeed no vulnerability data 😮
t
yup, so this is the issue for sure
s
So the other query was just returning all the vulnerabilities and not selecting based on id?
t
correct
we'll have a new version soon
s
Understandable than that the DB is dying on me with 10k+ hosts 😮
Is that query only running from the server with the vulnerability scanning enabled or running from all the servers?
t
running on all of them
s
That is why all my improvements had no effect, I understand it now
Gonna roll back my changes waiting on the new release, thanks for your help fixing this
Ping me when the new release is released, that I can test it immediately
t
will do!
s
Hey @Tomas Touceda I can confirm that identified query is the issue, I disabled the vulnerability check for now, waiting on the patch release and all systems are working fine again, no errors and no load issues anymore.
👍 1
Hey @Tomas Touceda Do you have an ETA on the release?
t
ETA is today, so unless we have see anything surprising, it'll land in a few hours
👍 1
cc @Gavin
4.4.3 is available
🎉 1
g
hurrah
f
I'm still seeing this issue on kubernetes or something similar: component=http path=/api/v1/osquery/distributed/write err="authentication error: find host .... context canceled" Fleet-Webserver docker.io/fleetdm/fleet:v4.6.1, Fleet-database-mysql docker.io/mysql:8.0.27 and fleet-cache-redis docker.io/bitnami/redis:6.2.6-debian-10-r53. There are no k8s/pod limits imposed, so resources are not an issue. Cluster is pretty beefy.
y
@Flngen Flugen did you resolve this issue ?