Title
#kolide
b

beatus

09/15/2020, 7:35 PM
👋 Would anyone have some recommendations on where to look further on this?
2020-09-15T19:30:47.771465407Z {"component":"service","err":"failed to ingest result: campaign waiting for listener","ip_addr":"x","level":"debug","method":"SubmitDistributedQueryResults","took":"8.342794ms","ts":"2020-09-15T19:30:47.771281978Z","x_for_ip_addr":"x"}
I've already got kolide in debug mode but i'm not seeing anything aside from the failure
z

Zach Zeid

09/15/2020, 9:05 PM
do you have
--tls_dump
enabled? it might tell you what api endpoint is failing
9:05 PM
do you have both distributed read/write endpoints configured?
zwass

zwass

09/16/2020, 4:47 PM
Are you getting any live query results? It's not unreasonable to see a few of these errors if you have a very high frequency of hosts checking in. Under normal circumstances the host will just try to submit results again in a few seconds and by then the listener will be attached.
b

beatus

09/16/2020, 7:38 PM
--tls_dump
enabling this now.
7:39 PM
zwass - it seems like they're completely broken
c

crimsonknave

09/16/2020, 7:42 PM
We have never gotten live queries to work reliably. This started when we moved to a 'proper' redis database managed by another team. Before we had one per cluster in a multicluster kubernetes deployement.
7:44 PM
It was the distributed write endpoint that was failing when I was debugging this weekend.
7:47 PM
We have 6500 online hosts and see about 460 fails every 10 seconds.
distributed_interval
is 30 seconds and
distributed_tls_max_attempts
is 3
7:56 PM
Specifically, our load balancer is seeing 500 errors from
/api/v1/distributed/write
, but we aren't running any live queries at all (unless there are some trapped in limbo or something?)
zwass

zwass

09/16/2020, 8:04 PM
Which version of fleet are each of you running?
8:04 PM
That scale should be easily supported by Fleet 3.x
8:04 PM
Fleet does run live queries to update the host "details" even if you are not running any manually.
b

beatus

09/16/2020, 8:13 PM
we upgraded to 3.1 last week
c

crimsonknave

09/16/2020, 8:38 PM
I found out the 'fun' way that's how Fleet knows that hosts are online when I turned the setting off to try and stop the errors 😆
8:39 PM
Should the
distributed_query_campaigns
table have entries that stay in it? (And have
deleted
of
0
?)
3:20 PM
We were able to resolve this by removing the live query entries in redis, the distributed_query_campaigns in the database and restarting osqueryd.
3:20 PM
However, in our production cluster any live query fails and triggers this.. I'm going to write up an issue for that.