:wave: Would anyone have some recommendations on w...
# kolide
b
👋 Would anyone have some recommendations on where to look further on this?
Copy code
2020-09-15T19:30:47.771465407Z {"component":"service","err":"failed to ingest result: campaign waiting for listener","ip_addr":"x","level":"debug","method":"SubmitDistributedQueryResults","took":"8.342794ms","ts":"2020-09-15T19:30:47.771281978Z","x_for_ip_addr":"x"}
I've already got kolide in debug mode but i'm not seeing anything aside from the failure
z
do you have
--tls_dump
enabled? it might tell you what api endpoint is failing
do you have both distributed read/write endpoints configured?
z
Are you getting any live query results? It's not unreasonable to see a few of these errors if you have a very high frequency of hosts checking in. Under normal circumstances the host will just try to submit results again in a few seconds and by then the listener will be attached.
b
--tls_dump
enabling this now.
zwass - it seems like they're completely broken
c
We have never gotten live queries to work reliably. This started when we moved to a 'proper' redis database managed by another team. Before we had one per cluster in a multicluster kubernetes deployement.
It was the distributed write endpoint that was failing when I was debugging this weekend.
We have 6500 online hosts and see about 460 fails every 10 seconds.
distributed_interval
is 30 seconds and
distributed_tls_max_attempts
is 3
Specifically, our load balancer is seeing 500 errors from
/api/v1/distributed/write
, but we aren't running any live queries at all (unless there are some trapped in limbo or something?)
z
Which version of fleet are each of you running?
That scale should be easily supported by Fleet 3.x
Fleet does run live queries to update the host "details" even if you are not running any manually.
b
we upgraded to 3.1 last week
c
I found out the 'fun' way that's how Fleet knows that hosts are online when I turned the setting off to try and stop the errors 😆
Should the
distributed_query_campaigns
table have entries that stay in it? (And have
deleted
of
0
?)
We were able to resolve this by removing the live query entries in redis, the distributed_query_campaigns in the database and restarting osqueryd.
However, in our production cluster any live query fails and triggers this.. I'm going to write up an issue for that.