This message was deleted.
# fleet
s
This message was deleted.
w
what is settings for conn timeout of fleet? where is settings?
k
Hi, @wennan.he ! How much memory do you have allocated for Fleet?
w
i never limited it.
it spent over 800G at most.
k
How is Fleet deployed? It might help to get a brief rundown of your infrastructure.
w
we deployed it by ourselves, but this situation is gone.
k
Sounds like there may have been a hiccup somewhere that worked itself out, I'll keep an eye out to see if it happens for anyone else or if it pops back up for you!
w
i am still seeing this situation going on, right we have 20k hosts and fleet is using 3-4g mem, and fleet responds pretty slow, i am feeling some thread taking too long on accessing db, is there anyway i can figure out which one? @Kathy Satterlee
and i have some new discover, and i c in our db of fleet, we have more 10 million records in the table of host_software i really doubt this table cause the problem, and i have couple of questions. 1 what is this table? what is used for? 2 i found we can disable host software of fleet, does it relative to this table? and how to disable it from fleet.service file?
k
What version of Fleet are you running? What path do you have set for
vulnerabilities.databases_path
? Does that folder have anything in it? Can you give a rundown of your Fleet architecture? It sounds like things may be struggling to keep up with the volume of traffic. The ‘host_software’ table tracks what software is installed on which hosts. With 20k hosts, I can definitely see that table getting quite large. You can disable software inventory and vulnerability scanning by setting `features.enable_software_inventory`: https://fleetdm.com/docs/using-fleet/vulnerability-processing#configuration
w
could u tell me where to check vulnerabilities.databases_path?
k
You can use
fleetctl get config --include-server-config
to pull your server config and check that value.
w
is there anyother way to check it?
k
Do you use environmental variables, a config file, or just command line flags to set up Fleet?
w
i have fleet.service file but it doesn't contain it.
this is the cfg
and could u tell me vulnerability processing or software inventory feature would cause huge requests to fleet?
k
Yes, it definitely can, especially when first enabled. Generally speaking, that activity dies down quite a bit once the inital data has been gathered. If that isn't set in the
fleet.conf
file, it may be the culprit. If it isn't, you'll need to either define it as a command line flag
--vulnerabilities-databases-path="/some/path"
(
tmp/vulndb
is common) or add it to the configuration file as an environmental variable. You can skip setting that if you disable software inventory, but I'd try making sure that is set up, restarting and seeing what happens first!
w
it is not in /etc/fleet/fleet.conf
and what is the env of
Copy code
vulnerability_settings
?
what is name of vulnerability_settings i should put in that cfg?
k
FLEET_VULNERABILITIES_DATABASES_PATH
w
i just create this path, do i need create any file under the path?
i tried and restart the fleet, looks like it becomes worse. the memory of fleet going higher.
k
There's a lot going on there right now, I'd expect that usage to be a bit high. Vulnerability processing does require 4GB of memory.
w
and you say it will die down after a while?
k
Yes. There's a lot to process at first, but once the initial data gathering and scans have happened, it'll settle down quite a bit.
w
so could u explain why fleet had that problem stay in high cpu and memory consuming (abut 3-4g) before i set this up? and some many errs(show above) in the log?
k
Things were getting bogged down because it was trying to process the vulnerabilities unsuccessfully since the database wasn't there. We've noticed that this can cause issues, so we're making some changes to give better messaging (and prevent Fleet from starting) when things aren't set up properly. https://github.com/fleetdm/fleet/issues/7810 Just to be clear though, you may see spikes in memory usage from time to time. Your baseline just shouldn't be this high.
w
hold on, that db is there for my case, i can see there a lot of records in my db. +------------------------------------+------------+ | table_name | table_rows | +------------------------------------+------------+ | host_software | 15338366 | | cve_meta | 191967 | | label_membership | 42152 | | host_users | 41266 | | host_seen_times | 20983 | | hosts | 19884 | | host_device_auth | 19667 | | host_operating_system | 18793 | | software_host_counts | 4418 | | software | 3927 | | migration_status_tables | 147 | | sessions | 31 | | software_cpe | 18 | | software_cve | 15 | | activities | 14 | | aggregated_stats | 11 | | migration_status_data | 9 | | operating_systems | 9 | | labels | 7 | | queries | 6 | | distributed_query_campaigns | 6 | | distributed_query_campaign_targets | 6 | | locks | 6 | | enroll_secrets | 3 | | windows_updates | 0 | | carve_blocks | 0 | | host_mdm | 0 | | network_interfaces | 0 | | users | 0 | | host_emails | 0 | | jobs | 0 | | app_config_json | 0 | | munki_issues | 0 | | user_teams | 0 | | scheduled_queries | 0 | | invites | 0 | | mobile_device_management_solutions | 0 | | teams | 0 | | host_batteries | 0 | | invite_teams | 0 | | statistics | 0 | | host_additional | 0 | | policy_membership | 0 | | policies | 0 | | email_changes | 0 | | password_reset_requests | 0 | | packs | 0 | | pack_targets | 0 | | osquery_options | 0 | | host_munki_issues | 0 | | scheduled_query_stats | 0 | | carve_metadata | 0 | | host_munki_info | 0 | +------------------------------------+------------+
that is my tables
cve_meta | 191967 | this is what you said right?
k
I'm talking about the vulnerabilities database in the directory that you just created and set in Fleet.
w
and the link saying the default path is /tmp/vulndbs and i also have it
k
Right, you have it now and things should start to settle once the processing is able to complete.
w
yes, FLEET_VULNERABILITIES_DATABASES_PATH=/var/fleet/ i c there r a lot of files under it
and i also can c there a lot similar files under /tmp/vulndbs
if this is the root cause how long my fleet would become normal?
k
Exactly. Now that those are there, Fleet will be able to process vulnerabilities successfully, and things should start running smoothly.
I can't give you an exact number there, there are a lot of variables that would contribute to the overall time it takes. You've got a lot of hosts with a lot of software so it could take a while.
w
but my fleet still running with too high cpu consuming.
k
Yes, because it's still processing.
w
and it still have a lot of errs in my log
k
What new errors are you seeing since restarting the server?
w
Sep 21 180504 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.40.209","level":"error","method":"POST Sep 21 180504 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.8.215","level":"error","method":"POST" Sep 21 180504 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T18050 Sep 21 180504 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.94.143","level":"error","method":"POST Sep 21 180504 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T18050 Sep 21 180504 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/config","ts":"2022-09-21T180504.40348689 Sep 21 180504 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T18050 Sep 21 180504 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.17.61","level":"error","method":"POST" Sep 21 180504 n107-019-021 fleet[3090473]: {"component":"http","ip_addr":"10.121.108.119","level":"debug","method":"POST","took":"14.727078384s","ts":"2022-09-21T180504.405334667Z","uri":"/api/v1/osqu Sep 21 180504 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T18050
this doesn't look right, the situation is not mitigated. cpu mem and err in log nothing change.
k
Let's check in on this again in a couple of hours. That will give time for the software processing to finish and hosts to check in a couple of times.
w
ok sure
k
Just for some context there, it does look like there's a bit of a bottleneck with MySQL that needs to be addressed, but it would be good to see if that levels out once things have had a bit to settle or is ongoing.
w
i really doubt that, because our fleet running with 20k hosts for a while, it never had issue before.
but sth wrong came up recently.
do you think single host mysql cannot handle 20k hosts?
k
It should be fine in theory, just might need to tweak a few things 🙂
Can you take a look at your recent logs now and we'll see what things are looking like?
w
nothing is going well.
i don't think this is caused by not setting up vulnerabilities database, it is been couple of hrs since it is set up.
fleet still in high usage of cpu
how to check why there are co much computing of fleet?
k
Thanks for giving it a bit to recheck. Sometimes when you find one problem, it's on to the next one. Let's keep digging in the errors and then see what the cpu usage looks like when things are running properly.
w
yes, but how
k
I'm noticing that all of the requests are timing out. Can you check the osquery logs on one of your hosts that is failing (based on the IP in the error) to see if there's any additional context there? If you're using Orbit (Fleet’s osquery package), here's where you can find those: https://github.com/fleetdm/fleet/tree/main/orbit#logs
w
ok let me check out
i only have permission to login one host and didn't find any valuable info.
is there anything else i can chekc