wave Would anyone have some recommendations on where to loo osquery #kolide

:wave: Would anyone have some recommendations on w...

beatus

09/15/2020, 7:35 PM

👋 Would anyone have some recommendations on where to look further on this?

Copy code

2020-09-15T19:30:47.771465407Z {"component":"service","err":"failed to ingest result: campaign waiting for listener","ip_addr":"x","level":"debug","method":"SubmitDistributedQueryResults","took":"8.342794ms","ts":"2020-09-15T19:30:47.771281978Z","x_for_ip_addr":"x"}

I've already got kolide in debug mode but i'm not seeing anything aside from the failure

Zach Zeid

09/15/2020, 9:05 PM

do you have

--tls_dump

enabled? it might tell you what api endpoint is failing

Zach Zeid

09/15/2020, 9:05 PM

do you have both distributed read/write endpoints configured?

zwass

09/16/2020, 4:47 PM

Are you getting any live query results? It's not unreasonable to see a few of these errors if you have a very high frequency of hosts checking in. Under normal circumstances the host will just try to submit results again in a few seconds and by then the listener will be attached.

beatus

09/16/2020, 7:38 PM

--tls_dump

enabling this now.

beatus

09/16/2020, 7:39 PM

zwass - it seems like they're completely broken

crimsonknave

09/16/2020, 7:42 PM

We have never gotten live queries to work reliably. This started when we moved to a 'proper' redis database managed by another team. Before we had one per cluster in a multicluster kubernetes deployement.

crimsonknave

09/16/2020, 7:44 PM

It was the distributed write endpoint that was failing when I was debugging this weekend.

crimsonknave

09/16/2020, 7:47 PM

We have 6500 online hosts and see about 460 fails every 10 seconds.

distributed_interval

is 30 seconds and

distributed_tls_max_attempts

is 3

crimsonknave

09/16/2020, 7:56 PM

Specifically, our load balancer is seeing 500 errors from

/api/v1/distributed/write

, but we aren't running any live queries at all (unless there are some trapped in limbo or something?)

zwass

09/16/2020, 8:04 PM

Which version of fleet are each of you running?

zwass

09/16/2020, 8:04 PM

That scale should be easily supported by Fleet 3.x

zwass

09/16/2020, 8:04 PM

Fleet does run live queries to update the host "details" even if you are not running any manually.

beatus

09/16/2020, 8:13 PM

we upgraded to 3.1 last week

crimsonknave

09/16/2020, 8:38 PM

I found out the 'fun' way that's how Fleet knows that hosts are online when I turned the setting off to try and stop the errors 😆

crimsonknave

09/16/2020, 8:39 PM

Should the

distributed_query_campaigns

table have entries that stay in it? (And have

deleted

crimsonknave

09/17/2020, 3:20 PM

We were able to resolve this by removing the live query entries in redis, the distributed_query_campaigns in the database and restarting osqueryd.

crimsonknave

09/17/2020, 3:20 PM

However, in our production cluster any live query fails and triggers this.. I'm going to write up an issue for that.

crimsonknave

09/17/2020, 4:10 PM

https://github.com/kolide/fleet/issues/2302

13 Views

Open in Slack

Previous Next