Hi everyone :grin:, I'm having a weird issue with ...
# fleet
o
Hi everyone 😁, I'm having a weird issue with osquery-fleet communication. I'm using Fleet 4.60.1 and osquery 5.13.1. I know that osquery sends a POST request to
/api/v1/osquery/distributed/read
to check-in for distributed queries, and sends the results as a POST request to
/api/v1/osquery/distributed/write
. I'm running my agent with
--verbose
and
--tls_dump
, to see the communication. I see a 'read' request and the received query to execute, then I see the 'write' request with the query's final results, but the weird part is that the next time the agent sends a 'read' request it gets the same query, as if it never returned it's results (sometimes it happens more than twice). Can someone please help? I have no clue what's wrong, and I didn't change any configuration (it just started happening today)🙏
k
Hi @Ortal Kombat! Are you seeing any errors in the Fleet server logs around query ingestion? Is this happening for all queries, or a particular query/type of query (detail queries/policies/live queries)?
o
Hi @Kathy Satterlee! I think I figured out why this happens. The
distributed_interval
was set to 10. When the agent got a query, it executed it right away and sent the results to
/api/v1/osquery/distributed/write
, but my Fleet server got it more than 10 seconds after, so the agent got the same query again because Fleet got the results only after the second time the agent queried
/api/v1/osquery/distributed/read
. I have a lot of agents (around 8000), and I think Fleet got the result too late because of high loads. Is there any way to disable some functions and queries from being executed regularly? For example I saw that when the agent sends a report about all the software on the machine it sends a HUGE JSON to Fleet (and from 8000 hosts it probably causes a lot of traffic). I also increased the
distributed_interval
to 120, but the problem still occurs sometimes. Thanks!
u
If a host has already picked up a distributed query once, it shouldn't pick it up again even if it checks in before sending results. It sounds like there may be some slowness between Fleet and Redis/MySQL. In the Fleet server logs, that will often manifest as "context cancelled" or "i/o timeout" errors.I'd definitely take a look at the reference architecture to make sure that you're properly scaled up for the quantity of hosts you're enrolling. There are a few other things you can do to spread out the load a bit. For instance, you could stretch out the intervals at which host data are collected.
o
@Kathy Satterlee Thanks, I followed your suggestions and it all run smoother now. Looks like adjusting the intervals to spread the load really helped.🙏
k
Awesome!