Hi Fleet team i want to put all my problems in a new thread osquery #fleet

Hi Fleet team, i want to put all my problems in a ...

wennan.he

09/22/2022, 4:22 AM

Hi Fleet team, i want to put all my problems in a new thread. our fleet is running in bad state, you can see from the above screenshot the traffic our lb for fleet is pretty high. and fleet server cpu usage

wennan.he

09/22/2022, 4:25 AM

all of these proves agents send a lot of requests for read and write to fleet which crash our service. but we still dont know why.

zwass

09/22/2022, 4:26 PM

How many agents do you have connected? How many Fleet servers?

wennan.he

09/22/2022, 5:48 PM

20k agents, 1fleet server

zwass

09/22/2022, 6:08 PM

With that many agents you would want to load balance to multiple Fleet servers.

zwass

09/22/2022, 6:09 PM

See https://fleetdm.com/docs/deploying/introduction#infrastructure-dependencies and https://fleetdm.com/docs/deploying/reference-architectures

wennan.he

09/22/2022, 6:11 PM

well standalone fleet cannot handle 20k agents? actually we had cluster fleets with 2 fleet servers, so each fleet server handle only 10k agents and it also crashed.

wennan.he

09/22/2022, 6:11 PM

so could u offer a mapping number of fleet-osquery service? how many osquery agents need one fleet server?

zwass

09/22/2022, 6:15 PM

Please have a look at our reference architectures. I would expect to need about 1 CPU core and 1GB memory per 1k agents, but it will depend on many factors.

wennan.he

09/22/2022, 6:43 PM

well. that is our cpu info root@n107-019-021:/var/fleet# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Stepping: 7 CPU MHz: 3599.978 BogoMIPS: 5999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K

wennan.he

09/22/2022, 6:44 PM

so it is 8 core cpu which can handle 8k agents as you said?

zwass

09/22/2022, 6:44 PM

I would try putting 4 of those behind a load balancer.

wennan.he

09/22/2022, 7:00 PM

OK ty

wennan.he

09/22/2022, 7:04 PM

and also, our service running over 6 months it didn't have this issue before, i would like to know why?

wennan.he

09/22/2022, 10:24 PM

@zwass i have another question, 1 cpu core for 1k agents, but why 4 of hosts with 8 core of each to cover 20k?

zwass

09/22/2022, 10:32 PM

It's just an estimate. The resource utilization depends on a number of factors, so I suggested something that might work.

wennan.he

09/22/2022, 10:50 PM

@zwass ok that make sense, but why our agents send so many requests for distributed read and write? is that normal or is there any way to debug it?

zwass

09/22/2022, 10:51 PM

Yes, that is normal. If you want to make them check in for live queries less often, you can configure

distributed_interval

. You could set that to

for example and then you might have to wait up to a minute for a host to respond to a live query.

zwass

09/22/2022, 10:57 PM

What is your

distributed_interval

setting? Each agent is probably sending a request every 10 or 60 seconds.

wennan.he

09/22/2022, 10:57 PM

could you share me how to check active value for distributed_interval?

zwass

09/22/2022, 10:58 PM

If you go to the details page for a host you can see the configured value under "Agent options"

wennan.he

09/22/2022, 10:59 PM

20s means what?

zwass

09/22/2022, 10:59 PM

20 seconds

zwass

09/22/2022, 11:00 PM

That means every 20 seconds the agent will connect for live queries

wennan.he

09/22/2022, 11:00 PM

through distributed read?

zwass

09/22/2022, 11:00 PM

Yes

wennan.he

09/22/2022, 11:02 PM

if that is the case, we are supposed have 600K requests only for distributed read every 10min if we have 20k hosts with osquery, right?

wennan.he

09/22/2022, 11:03 PM

20000 x 3 x 10 = 600k

wennan.he

09/22/2022, 11:03 PM

but the statistics cannot prove that.

zwass

09/22/2022, 11:08 PM

I am not sure how to explain that. Possibly there is high request latency and that reduces the total number of requests?

wennan.he

09/22/2022, 11:09 PM

but this is not so accurate.

wennan.he

09/22/2022, 11:10 PM

ok that generally answer my doubt, ty for explain.

wennan.he

09/22/2022, 11:17 PM

another question, could u try to answer in which case, agent will not send distributed read to fleet?

zwass

09/22/2022, 11:56 PM

It should always do it when the process is running

2 Views

Open in Slack

Previous Next