Hi Fleet team, i want to put all my problems in a ...
# fleet
w
Hi Fleet team, i want to put all my problems in a new thread. our fleet is running in bad state, you can see from the above screenshot the traffic our lb for fleet is pretty high. and fleet server cpu usage
all of these proves agents send a lot of requests for read and write to fleet which crash our service. but we still dont know why.
z
How many agents do you have connected? How many Fleet servers?
w
20k agents, 1fleet server
z
With that many agents you would want to load balance to multiple Fleet servers.
w
well standalone fleet cannot handle 20k agents? actually we had cluster fleets with 2 fleet servers, so each fleet server handle only 10k agents and it also crashed.
so could u offer a mapping number of fleet-osquery service? how many osquery agents need one fleet server?
z
Please have a look at our reference architectures. I would expect to need about 1 CPU core and 1GB memory per 1k agents, but it will depend on many factors.
w
well. that is our cpu info root@n107-019-021:/var/fleet# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Stepping: 7 CPU MHz: 3599.978 BogoMIPS: 5999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K
so it is 8 core cpu which can handle 8k agents as you said?
z
I would try putting 4 of those behind a load balancer.
w
OK ty
and also, our service running over 6 months it didn't have this issue before, i would like to know why?
@zwass i have another question, 1 cpu core for 1k agents, but why 4 of hosts with 8 core of each to cover 20k?
z
It's just an estimate. The resource utilization depends on a number of factors, so I suggested something that might work.
w
@zwass ok that make sense, but why our agents send so many requests for distributed read and write? is that normal or is there any way to debug it?
z
Yes, that is normal. If you want to make them check in for live queries less often, you can configure
distributed_interval
. You could set that to
60
for example and then you might have to wait up to a minute for a host to respond to a live query.
What is your
distributed_interval
setting? Each agent is probably sending a request every 10 or 60 seconds.
w
could you share me how to check active value for distributed_interval?
z
If you go to the details page for a host you can see the configured value under "Agent options"
w
20s means what?
z
20 seconds
That means every 20 seconds the agent will connect for live queries
w
through distributed read?
z
Yes
w
if that is the case, we are supposed have 600K requests only for distributed read every 10min if we have 20k hosts with osquery, right?
20000 x 3 x 10 = 600k
but the statistics cannot prove that.
z
I am not sure how to explain that. Possibly there is high request latency and that reduces the total number of requests?
w
but this is not so accurate.
ok that generally answer my doubt, ty for explain.
another question, could u try to answer in which case, agent will not send distributed read to fleet?
z
It should always do it when the process is running