Title
#fleet
w

wennan.he

09/17/2022, 9:20 PM
Hi Fleet team, we are suffering from rush request and err recently on fleet with log May 06 10:25:01 fleet-01.test.tech fleet[3448836]: {"component":"http","err":"timestamp: 2022-05-06T10:24:58Z: error in query ingestion || timestamp: 2022-05-06T10:25:01Z: error in query ingestion || timestamp: 2022-05-06T10:25:01Z: error in query ingestion || timestamp: 2022-05-06T10:25:01Z: error in query ingestion || timestamp: 2022-05-06T10:25:01Z: error in query ingestion || timestamp: 2022-05-06T10:25:01Z: error in query ingestion || getting app config: selecting app config: timestamp: 2022-05-06T10:25:01Z: context canceled","ingestion-err":"ingest detail query: selecting app config: timestamp: 2022-05-06T10:25:01Z: context canceled","ip_addr":"172.10.11.11","level":"error","method":"POST", "took":"19.280912956s","ts":"2022-05-06T10:25:01.630667525Z","uri":"/api/v1/osquery/distributed/write","x_for_ip_addr":"172.10.11.11"} May 06 10:25:03 fleet-01.test.tech fleet[3448836]: {"component":"http","err":"timestamp: 2022-05-06T10:24:58Z: error in query ingestion || create transaction: timestamp: 2022-05-06T10:25:03Z: context canceled || save host with id 403: timestamp: 2022-05-06T10:25:03Z: context canceled","ingestion-err":"ingesting query software_linux: update host software: insert software: timestamp: 2022-05-06T10:24:58Z: context canceled","ip_addr":"172.10.11.12","level":"error","method":"POST", "took":"20.692362396s","ts":"2022-05-06T10:25:03.53958792Z","uri":"/api/v1/osquery/distributed/write","x_for_ip_addr":"172.10.11.12"} i doubt this is caused by mysql timeout, is there any strategy of optimization on mysql for fleet?
10:29 PM
what is settings for conn timeout of fleet? where is settings?
Kathy Satterlee

Kathy Satterlee

09/19/2022, 4:21 PM
Hi, @wennan.he ! How much memory do you have allocated for Fleet?
w

wennan.he

09/19/2022, 5:12 PM
i never limited it.
5:12 PM
it spent over 800G at most.
Kathy Satterlee

Kathy Satterlee

09/19/2022, 9:44 PM
How is Fleet deployed? It might help to get a brief rundown of your infrastructure.
w

wennan.he

09/19/2022, 10:09 PM
we deployed it by ourselves, but this situation is gone.
Kathy Satterlee

Kathy Satterlee

09/19/2022, 10:21 PM
Sounds like there may have been a hiccup somewhere that worked itself out, I'll keep an eye out to see if it happens for anyone else or if it pops back up for you!
w

wennan.he

09/21/2022, 4:10 AM
i am still seeing this situation going on, right we have 20k hosts and fleet is using 3-4g mem, and fleet responds pretty slow, i am feeling some thread taking too long on accessing db, is there anyway i can figure out which one?@Kathy Satterlee
4:47 AM
and i have some new discover, and i c in our db of fleet, we have more 10 million records in the table of host_software i really doubt this table cause the problem, and i have couple of questions. 1 what is this table? what is used for? 2 i found we can disable host software of fleet, does it relative to this table? and how to disable it from fleet.service file?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 3:33 PM
What version of Fleet are you running? What path do you have set for
vulnerabilities.databases_path
? Does that folder have anything in it? Can you give a rundown of your Fleet architecture? It sounds like things may be struggling to keep up with the volume of traffic. The ‘host_software’ table tracks what software is installed on which hosts. With 20k hosts, I can definitely see that table getting quite large. You can disable software inventory and vulnerability scanning by setting features.enable_software_inventory: https://fleetdm.com/docs/using-fleet/vulnerability-processing#configuration
w

wennan.he

09/21/2022, 4:37 PM
could u tell me where to check vulnerabilities.databases_path?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 4:43 PM
You can use
fleetctl get config --include-server-config
to pull your server config and check that value.
w

wennan.he

09/21/2022, 5:06 PM
is there anyother way to check it?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 5:14 PM
Do you use environmental variables, a config file, or just command line flags to set up Fleet?
w

wennan.he

09/21/2022, 5:20 PM
i have fleet.service file but it doesn't contain it.
5:20 PM
[Unit] Description=Fleet After=network.target [Service] User=root Group=root LimitNOFILE=20000 EnvironmentFile=-/etc/fleet/fleet.conf ExecStart=/usr/bin/fleet serve \ --mysql_address=127.0.0.1:3306 \ --mysql_database=fleet \ --redis_address=127.0.0.1:6379 \ --redis_password=fleetpass \ --filesystem_enable_log_compression=true \ --filesystem_enable_log_rotation=true \ --filesystem_result_log_file=/var/log/fleet/result.log \ --server_tls=false \ --logging_json=true \ --logging_debug=true [Install] WantedBy=multi-user.target
5:20 PM
this is the cfg
5:22 PM
and could u tell me vulnerability processing or software inventory feature would cause huge requests to fleet?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 5:28 PM
Yes, it definitely can, especially when first enabled. Generally speaking, that activity dies down quite a bit once the inital data has been gathered. If that isn't set in the
fleet.conf
file, it may be the culprit. If it isn't, you'll need to either define it as a command line flag
--vulnerabilities-databases-path="/some/path"
(
tmp/vulndb
is common) or add it to the configuration file as an environmental variable. You can skip setting that if you disable software inventory, but I'd try making sure that is set up, restarting and seeing what happens first!
w

wennan.he

09/21/2022, 5:29 PM
it is not in /etc/fleet/fleet.conf
5:30 PM
and what is the env of
vulnerability_settings
?
5:31 PM
what is name of vulnerability_settings i should put in that cfg?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 5:31 PM
FLEET_VULNERABILITIES_DATABASES_PATH
w

wennan.he

09/21/2022, 5:38 PM
i just create this path, do i need create any file under the path?
5:47 PM
i tried and restart the fleet, looks like it becomes worse. the memory of fleet going higher.
Kathy Satterlee

Kathy Satterlee

09/21/2022, 5:48 PM
There's a lot going on there right now, I'd expect that usage to be a bit high. Vulnerability processing does require 4GB of memory.
w

wennan.he

09/21/2022, 5:48 PM
and you say it will die down after a while?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 5:49 PM
Yes. There's a lot to process at first, but once the initial data gathering and scans have happened, it'll settle down quite a bit.
w

wennan.he

09/21/2022, 5:50 PM
so could u explain why fleet had that problem stay in high cpu and memory consuming (abut 3-4g) before i set this up? and some many errs(show above) in the log?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 5:56 PM
Things were getting bogged down because it was trying to process the vulnerabilities unsuccessfully since the database wasn't there. We've noticed that this can cause issues, so we're making some changes to give better messaging (and prevent Fleet from starting) when things aren't set up properly. https://github.com/fleetdm/fleet/issues/7810 Just to be clear though, you may see spikes in memory usage from time to time. Your baseline just shouldn't be this high.
w

wennan.he

09/21/2022, 5:57 PM
hold on, that db is there for my case, i can see there a lot of records in my db. +------------------------------------+------------+ | table_name | table_rows | +------------------------------------+------------+ | host_software | 15338366 | | cve_meta | 191967 | | label_membership | 42152 | | host_users | 41266 | | host_seen_times | 20983 | | hosts | 19884 | | host_device_auth | 19667 | | host_operating_system | 18793 | | software_host_counts | 4418 | | software | 3927 | | migration_status_tables | 147 | | sessions | 31 | | software_cpe | 18 | | software_cve | 15 | | activities | 14 | | aggregated_stats | 11 | | migration_status_data | 9 | | operating_systems | 9 | | labels | 7 | | queries | 6 | | distributed_query_campaigns | 6 | | distributed_query_campaign_targets | 6 | | locks | 6 | | enroll_secrets | 3 | | windows_updates | 0 | | carve_blocks | 0 | | host_mdm | 0 | | network_interfaces | 0 | | users | 0 | | host_emails | 0 | | jobs | 0 | | app_config_json | 0 | | munki_issues | 0 | | user_teams | 0 | | scheduled_queries | 0 | | invites | 0 | | mobile_device_management_solutions | 0 | | teams | 0 | | host_batteries | 0 | | invite_teams | 0 | | statistics | 0 | | host_additional | 0 | | policy_membership | 0 | | policies | 0 | | email_changes | 0 | | password_reset_requests | 0 | | packs | 0 | | pack_targets | 0 | | osquery_options | 0 | | host_munki_issues | 0 | | scheduled_query_stats | 0 | | carve_metadata | 0 | | host_munki_info | 0 | +------------------------------------+------------+
5:57 PM
that is my tables
5:58 PM
cve_meta | 191967 | this is what you said right?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 5:59 PM
I'm talking about the vulnerabilities database in the directory that you just created and set in Fleet.
w

wennan.he

09/21/2022, 5:59 PM
and the link saying the default path is /tmp/vulndbs and i also have it
Kathy Satterlee

Kathy Satterlee

09/21/2022, 6:00 PM
Right, you have it now and things should start to settle once the processing is able to complete.
w

wennan.he

09/21/2022, 6:00 PM
yes, FLEET_VULNERABILITIES_DATABASES_PATH=/var/fleet/ i c there r a lot of files under it
6:01 PM
and i also can c there a lot similar files under /tmp/vulndbs
6:01 PM
if this is the root cause how long my fleet would become normal?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 6:01 PM
Exactly. Now that those are there, Fleet will be able to process vulnerabilities successfully, and things should start running smoothly.
6:03 PM
I can't give you an exact number there, there are a lot of variables that would contribute to the overall time it takes. You've got a lot of hosts with a lot of software so it could take a while.
w

wennan.he

09/21/2022, 6:03 PM
but my fleet still running with too high cpu consuming.
Kathy Satterlee

Kathy Satterlee

09/21/2022, 6:03 PM
Yes, because it's still processing.
w

wennan.he

09/21/2022, 6:03 PM
and it still have a lot of errs in my log
Kathy Satterlee

Kathy Satterlee

09/21/2022, 6:05 PM
What new errors are you seeing since restarting the server?
w

wennan.he

09/21/2022, 6:05 PM
Sep 21 18:05:04 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.40.209","level":"error","method":"POST Sep 21 18:05:04 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.8.215","level":"error","method":"POST" Sep 21 18:05:04 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T18:05:0 Sep 21 18:05:04 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.94.143","level":"error","method":"POST Sep 21 18:05:04 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T18:05:0 Sep 21 18:05:04 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/config","ts":"2022-09-21T18:05:04.40348689 Sep 21 18:05:04 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T18:05:0 Sep 21 18:05:04 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.17.61","level":"error","method":"POST" Sep 21 18:05:04 n107-019-021 fleet[3090473]: {"component":"http","ip_addr":"10.121.108.119","level":"debug","method":"POST","took":"14.727078384s","ts":"2022-09-21T18:05:04.405334667Z","uri":"/api/v1/osqu Sep 21 18:05:04 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T18:05:0
6:07 PM
this doesn't look right, the situation is not mitigated. cpu mem and err in log nothing change.
Kathy Satterlee

Kathy Satterlee

09/21/2022, 6:20 PM
Let's check in on this again in a couple of hours. That will give time for the software processing to finish and hosts to check in a couple of times.
w

wennan.he

09/21/2022, 6:23 PM
ok sure
Kathy Satterlee

Kathy Satterlee

09/21/2022, 6:48 PM
Just for some context there, it does look like there's a bit of a bottleneck with MySQL that needs to be addressed, but it would be good to see if that levels out once things have had a bit to settle or is ongoing.
w

wennan.he

09/21/2022, 6:54 PM
i really doubt that, because our fleet running with 20k hosts for a while, it never had issue before.
6:55 PM
but sth wrong came up recently.
6:55 PM
do you think single host mysql cannot handle 20k hosts?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 8:44 PM
It should be fine in theory, just might need to tweak a few things 🙂
8:44 PM
Can you take a look at your recent logs now and we'll see what things are looking like?
w

wennan.he

09/21/2022, 9:50 PM
nothing is going well.
9:50 PM
Sep 21 21:50:04 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve policy queries: selecting policies for host: context canceled","ip_addr":"10.121.94.204","level":"error","method":"POST","took":"15.985334748s","ts":"2022-09-21T21:50:04.876513568Z","uri":"/api/v1/osquery/distributed/read","x_for_ip_addr":"10.121.94.204"} Sep 21 21:50:04 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve policy queries: selecting policies for host: context canceled","ip_addr":"10.121.111.218","level":"error","method":"POST","took":"15.986812811s","ts":"2022-09-21T21:50:04.876783944Z","uri":"/api/v1/osquery/distributed/read","x_for_ip_addr":"10.121.111.218"} Sep 21 21:50:04 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.10.84","level":"error","method":"POST","took":"15.975101385s","ts":"2022-09-21T21:50:04.877399287Z","uri":"/api/v1/osquery/distributed/read","x_for_ip_addr":"10.121.10.84"} Sep 21 21:50:04 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.29.109","level":"error","method":"POST","took":"15.983538752s","ts":"2022-09-21T21:50:04.877772402Z","uri":"/api/v1/osquery/distributed/read","x_for_ip_addr":"10.121.29.109"} Sep 21 21:50:04 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T21:50:04.877998806Z"} Sep 21 21:50:04 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/write","ts":"2022-09-21T21:50:04.878149391Z"} Sep 21 21:50:04 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.37.123","level":"error","method":"POST","took":"15.979903898s","ts":"2022-09-21T21:50:04.878665921Z","uri":"/api/v1/osquery/distributed/read","x_for_ip_addr":"10.121.37.123"} Sep 21 21:50:04 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/write","ts":"2022-09-21T21:50:04.878824292Z"} Sep 21 21:50:04 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T21:50:04.879421601Z"} Sep 21 21:50:04 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve label queries: selecting label queries for host: context canceled","ip_addr":"10.121.16.185","level":"error","method":"POST","took":"15.980457397s","ts":"2022-09-21T21:50:04.879521087Z","uri":"/api/v1/osquery/distributed/read","x_for_ip_addr":"10.121.16.185"}
9:52 PM
i don't think this is caused by not setting up vulnerabilities database, it is been couple of hrs since it is set up.
9:58 PM
fleet still in high usage of cpu
9:59 PM
how to check why there are co much computing of fleet?
Kathy Satterlee

Kathy Satterlee

09/21/2022, 10:08 PM
Thanks for giving it a bit to recheck. Sometimes when you find one problem, it's on to the next one. Let's keep digging in the errors and then see what the cpu usage looks like when things are running properly.
w

wennan.he

09/21/2022, 10:10 PM
yes, but how
Kathy Satterlee

Kathy Satterlee

09/21/2022, 10:11 PM
I'm noticing that all of the requests are timing out. Can you check the osquery logs on one of your hosts that is failing (based on the IP in the error) to see if there's any additional context there? If you're using Orbit (Fleet’s osquery package), here's where you can find those: https://github.com/fleetdm/fleet/tree/main/orbit#logs
w

wennan.he

09/21/2022, 10:32 PM
ok let me check out
10:54 PM
i only have permission to login one host and didn't find any valuable info.
10:54 PM
is there anything else i can chekc