Title
#fleet
w

wennan.he

09/22/2022, 4:22 AM
also fleet consuming high mem systemctl status fleet.service ā— fleet.service - Fleet Loaded: loaded (/etc/systemd/system/fleet.service; disabled; vendor preset: enabled) Active: active (running) since Wed 2022-09-21 17:42:23 UTC; 5h 20min ago Main PID: 3090473 (fleet) Tasks: 19 (limit: 4915) Memory: 3.9G CPU: 11h 21min 13.035s CGroup: /system.slice/fleet.service ā””ā”€3090473 /usr/bin/fleet serve --mysql_address=127.0.0.1:3306 --mysql_database=fleet --mysql_username=root --mysql_password=admin --redis_address=127.0.0.1:6379 --redis_password=fleetpass --fil err in log of fleet Sep 21 23:03:08 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/config","ts":"2022-09-21T23:03:08.480797827Z"} Sep 21 23:03:08 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/config","ts":"2022-09-21T23:03:08.481011716Z"} Sep 21 23:03:08 n107-019-021 fleet[3090473]: {"component":"http","err":"error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || getting app config: selecting app config: context canceled","ingestion-err":"ingest detail query: selecting app config: context canceled","ip_addr":"10.121.73.56","level":"error","method":"POST","took":"28.666015098s","ts":"2022-09-21T23:03:08.481223072Z","uri":"/api/v1/osquery/distributed/write","x_for_ip_addr":"10.121.73.56"} Sep 21 23:03:08 n107-019-021 fleet[3090473]: 2022/09/21 23:03:08 http: Accept error: accept tcp [::]:8080: accept4: too many open files; retrying in 5ms Sep 21 23:03:08 n107-019-021 fleet[3090473]: {"component":"http","err":"error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || getting app config: selecting app config: context canceled","ingestion-err":"ingest detail query: selecting app config: context canceled","ip_addr":"10.121.42.98","level":"error","method":"POST","took":"29.773316374s","ts":"2022-09-21T23:03:08.482628392Z","uri":"/api/v1/osquery/distributed/write","x_for_ip_addr":"10.121.42.98"} Sep 21 23:03:08 n107-019-021 fleet[3090473]: {"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T23:03:08.485681384Z"} Sep 21 23:03:08 n107-019-021 fleet[3090473]: 2022/09/21 23:03:08 http: Accept error: accept tcp [::]:8080: accept4: too many open files; retrying in 5ms Sep 21 23:03:08 n107-019-021 fleet[3090473]: {"component":"http","err":"error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || getting app config: selecting app config: context canceled","ingestion-err":"ingest detail query: selecting app config: context canceled","ip_addr":"10.121.31.61","level":"error","method":"POST","took":"24.81762093s","ts":"2022-09-21T23:03:08.489245127Z","uri":"/api/v1/osquery/distributed/write","x_for_ip_addr":"10.121.31.61"} Sep 21 23:03:08 n107-019-021 fleet[3090473]: 2022/09/21 23:03:08 http: Accept error: accept tcp [::]:8080: accept4: too many open files; retrying in 5ms Sep 21 23:03:08 n107-019-021 fleet[3090473]: {"component":"http","err":"error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || error in query ingestion || getting app config: selecting app config: context canceled","ingestion-err":"ingest detail query: selecting app config: context canceled","ip_addr":"10.121.29.42","level":"error","method":"POST","took":"28.550266401s","ts":"2022-09-21T23:03:08.496053427Z","uri":"/api/v1/osquery/distributed/write","x_for_ip_addr":"10.121.29.42"} Sep 21 23:03:08 n107-019-021 fleet[3090473]: {"component":"http","err":"retrieve policy queries: selecting policies for host: context canceled","ip_addr":"10.121.35.121","level":"error","method":"POST","took":"15.98712566s","ts":"2022-09-21T23:03:08.498432891Z","uri":"/api/v1/osquery/distributed/read","x_for_ip_addr":"10.121.35.121"} but we only have 20k hosts, plz help to advice.
4:25 AM
all of these proves agents send a lot of requests for read and write to fleet which crash our service. but we still dont know why.
zwass

zwass

09/22/2022, 4:26 PM
How many agents do you have connected? How many Fleet servers?
w

wennan.he

09/22/2022, 5:48 PM
20k agents, 1fleet server
zwass

zwass

09/22/2022, 6:08 PM
With that many agents you would want to load balance to multiple Fleet servers.
w

wennan.he

09/22/2022, 6:11 PM
well standalone fleet cannot handle 20k agents? actually we had cluster fleets with 2 fleet servers, so each fleet server handle only 10k agents and it also crashed.
6:11 PM
so could u offer a mapping number of fleet-osquery service? how many osquery agents need one fleet server?
zwass

zwass

09/22/2022, 6:15 PM
Please have a look at our reference architectures. I would expect to need about 1 CPU core and 1GB memory per 1k agents, but it will depend on many factors.
w

wennan.he

09/22/2022, 6:43 PM
well. that is our cpu info root@n107-019-021šŸ˜•var/fleet# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Stepping: 7 CPU MHz: 3599.978 BogoMIPS: 5999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K
6:44 PM
so it is 8 core cpu which can handle 8k agents as you said?
zwass

zwass

09/22/2022, 6:44 PM
I would try putting 4 of those behind a load balancer.
w

wennan.he

09/22/2022, 7:00 PM
OK ty
7:04 PM
and also, our service running over 6 months it didn't have this issue before, i would like to know why?
10:24 PM
@zwass i have another question, 1 cpu core for 1k agents, but why 4 of hosts with 8 core of each to cover 20k?
zwass

zwass

09/22/2022, 10:32 PM
It's just an estimate. The resource utilization depends on a number of factors, so I suggested something that might work.
w

wennan.he

09/22/2022, 10:50 PM
@zwass ok that make sense, but why our agents send so many requests for distributed read and write? is that normal or is there any way to debug it?
zwass

zwass

09/22/2022, 10:51 PM
Yes, that is normal. If you want to make them check in for live queries less often, you can configure
distributed_interval
. You could set that to
60
for example and then you might have to wait up to a minute for a host to respond to a live query.
10:57 PM
What is your
distributed_interval
setting? Each agent is probably sending a request every 10 or 60 seconds.
w

wennan.he

09/22/2022, 10:57 PM
could you share me how to check active value for distributed_interval?
zwass

zwass

09/22/2022, 10:58 PM
If you go to the details page for a host you can see the configured value under "Agent options"
w

wennan.he

09/22/2022, 10:59 PM
20s means what?
zwass

zwass

09/22/2022, 10:59 PM
20 seconds
11:00 PM
That means every 20 seconds the agent will connect for live queries
w

wennan.he

09/22/2022, 11:00 PM
through distributed read?
zwass

zwass

09/22/2022, 11:00 PM
Yes
w

wennan.he

09/22/2022, 11:02 PM
if that is the case, we are supposed have 600K requests only for distributed read every 10min if we have 20k hosts with osquery, right?
11:03 PM
20000 x 3 x 10 = 600k
11:03 PM
but the statistics cannot prove that.
zwass

zwass

09/22/2022, 11:08 PM
I am not sure how to explain that. Possibly there is high request latency and that reduces the total number of requests?
w

wennan.he

09/22/2022, 11:09 PM
but this is not so accurate.
11:10 PM
ok that generally answer my doubt, ty for explain.
11:17 PM
another question, could u try to answer in which case, agent will not send distributed read to fleet?
zwass

zwass

09/22/2022, 11:56 PM
It should always do it when the process is running