https://github.com/osquery/osquery logo
#kolide
Title
# kolide
b

beatus

09/15/2020, 7:35 PM
👋 Would anyone have some recommendations on where to look further on this?
Copy code
2020-09-15T19:30:47.771465407Z {"component":"service","err":"failed to ingest result: campaign waiting for listener","ip_addr":"x","level":"debug","method":"SubmitDistributedQueryResults","took":"8.342794ms","ts":"2020-09-15T19:30:47.771281978Z","x_for_ip_addr":"x"}
I've already got kolide in debug mode but i'm not seeing anything aside from the failure
z

Zach Zeid

09/15/2020, 9:05 PM
do you have
--tls_dump
enabled? it might tell you what api endpoint is failing
do you have both distributed read/write endpoints configured?
z

zwass

09/16/2020, 4:47 PM
Are you getting any live query results? It's not unreasonable to see a few of these errors if you have a very high frequency of hosts checking in. Under normal circumstances the host will just try to submit results again in a few seconds and by then the listener will be attached.
b

beatus

09/16/2020, 7:38 PM
--tls_dump
enabling this now.
zwass - it seems like they're completely broken
c

crimsonknave

09/16/2020, 7:42 PM
We have never gotten live queries to work reliably. This started when we moved to a 'proper' redis database managed by another team. Before we had one per cluster in a multicluster kubernetes deployement.
It was the distributed write endpoint that was failing when I was debugging this weekend.
We have 6500 online hosts and see about 460 fails every 10 seconds.
distributed_interval
is 30 seconds and
distributed_tls_max_attempts
is 3
Specifically, our load balancer is seeing 500 errors from
/api/v1/distributed/write
, but we aren't running any live queries at all (unless there are some trapped in limbo or something?)
z

zwass

09/16/2020, 8:04 PM
Which version of fleet are each of you running?
That scale should be easily supported by Fleet 3.x
Fleet does run live queries to update the host "details" even if you are not running any manually.
b

beatus

09/16/2020, 8:13 PM
we upgraded to 3.1 last week
c

crimsonknave

09/16/2020, 8:38 PM
I found out the 'fun' way that's how Fleet knows that hosts are online when I turned the setting off to try and stop the errors 😆
Should the
distributed_query_campaigns
table have entries that stay in it? (And have
deleted
of
0
?)
We were able to resolve this by removing the live query entries in redis, the distributed_query_campaigns in the database and restarting osqueryd.
However, in our production cluster any live query fails and triggers this.. I'm going to write up an issue for that.
8 Views