Fleet seems to be logging a lot of lines like this during st osquery #fleet

Fleet seems to be logging a lot of lines like this...

Ryan

02/08/2022, 6:20 PM

Fleet seems to be logging a lot of lines like this during startup, does anyone know what they mean?

Copy code

level=error ts=2022-02-08T18:18:27.469599841Z op=QueriesForHost err="load active queries: EOF"

Tomas Touceda

02/08/2022, 6:24 PM

that part reaches out to redis to check for live queries that might be running

Tomas Touceda

02/08/2022, 6:24 PM

is redis running properly?

Ryan

02/08/2022, 6:24 PM

ah right, so EOF could mean a problem connecting to Redis?

Ryan

02/08/2022, 6:24 PM

I have suspicions that it isn’t able to connect properly

Ryan

02/08/2022, 6:24 PM

thanks for the pointer 👍

Tomas Touceda

02/08/2022, 6:24 PM

yup, it connects to redis and runs a SMEMBERS, if at some point something just chops off the connection, that's the error that could happen

ty 1

Ryan

02/08/2022, 6:24 PM

right oh

Ryan

02/08/2022, 6:25 PM

i’ll double check the setup

Ryan

02/08/2022, 6:58 PM

Still debugging, but I may have to continue tomorrow.

Ryan

02/08/2022, 6:58 PM

Incidentally, we’re seeing this logged a lot when executing distributed queries:

Copy code

method=POST uri=/api/v1/osquery/distributed/write took=1.86989925s ip_addr=<ip>:45002 x_for_ip_addr= ingestion-err="campaign waiting for listener (please retry)" err="timestamp: 2022-02-08T18:57:28Z: error in query ingestion

Ryan

02/08/2022, 6:59 PM

I’ll pick this up again tomorrow.

👍 1

Tomas Touceda

02/08/2022, 6:59 PM

certainly looks like something around redis misbehaving

Ryan

02/09/2022, 10:31 AM

Hi @Tomas Touceda I’m still trying to debug what’s going on here - the Redis instance itself seems fine, but we’re still seeing all sorts of weird behaviour in Fleet as a result. Is there a quick way to verify the connection from Fleet to Redis is ok?

Ryan

02/09/2022, 10:39 AM

I’m going to verify it using

tlsconnect.go

Ryan

02/09/2022, 10:51 AM

Copy code

2022/02/09 10:50:12 pool created successfully
2022/02/09 10:50:13 command result: NOAUTH Authentication required. ; NOAUTH Authentication required.

Looks good! the AUTH value is set in the Fleet.yml, doesn’t seem to be possible to test that with the

tlsconnect.go

tool though.

Ryan

02/09/2022, 11:00 AM

Copy code

ingestion-err="writing results: PUBLISH failed to channel results_49: EOF" err="timestamp: 2022-02-09T10:57:52Z: error in query ingestion"

Ryan

02/09/2022, 11:01 AM

the weird thing is - Fleet is logging all of these errors, but Live Queries do work, as do scheduled query packs, so I have no idea what the issue is 🤔

Ryan

02/09/2022, 11:01 AM

this is using a GCP Memorystore instance with TLS and AUTH enabled in case anyone can reproduce this.

Tomas Touceda

02/09/2022, 11:04 AM

let me consult with the team to see if anybody else has any other ideas

Ryan

02/09/2022, 11:04 AM

thanks 🙂

Ryan

02/09/2022, 11:06 AM

There are some constraints on Memorystore compared to “normal” Redis, but I don’t think any of them should affect us: https://cloud.google.com/memorystore/docs/redis/product-constraints

Benjamin Edwards

02/09/2022, 1:33 PM

Are you using Standard or Basic tier?

Ryan

02/09/2022, 4:29 PM

Basic tier for this particular instance I think.

Ryan

02/09/2022, 4:30 PM

Yeah it’s basic tier.

Ryan

02/09/2022, 4:30 PM

I’m pretty sure the issues being logged started when switching to TLS though, we used the same Basic Memorystore without TLS previously and it was fine I think.

Ryan

02/09/2022, 4:31 PM

Stupidly we changed several things at once though, including upgrading to v4.9.1.

Benjamin Edwards

02/09/2022, 4:37 PM

There are other limitations for basic tier but not sure yet how they might be affecting your setup https://cloud.google.com/memorystore/docs/redis/product-constraints#basic_tier_limitations

Ryan

02/10/2022, 11:42 AM

Yeah those are acceptable I think, it’s basically maintenance windows they will upgrade, but no maintenance has been performed so far.

Ryan

02/10/2022, 11:42 AM

looks like our scheduled queries continued to work overnight by the way

Ryan

02/10/2022, 11:43 AM

seems like Live Queries are working still too

Ryan

02/10/2022, 11:54 AM

actually, no they’re not working 100%

Ryan

02/10/2022, 11:54 AM

hanging at Online: 1276 hosts / 1047 results

Ryan

02/10/2022, 11:54 AM

so yeah I suspect it isn’t happy with the TLS feature for some reason

Ryan

02/10/2022, 6:16 PM

Update for you both @Benjamin Edwards and @Tomas Touceda, running another Fleet instance with Memorystore Redis when TLS is disabled is working absolutely fine, so we’re suspecting there’s something not quite right when TLS is enabled, but it’s not outright failing, just not working correctly. The only thing I can think of that isn’t quite right, is GCP Memorystore signs the certificate using the IP address of the instance, not the DNS name we specify in

address

? But we do set the

tls_server_name

config option to the IP to fix that error, so it should be fine? It’s a Redis 6 instance too, is that significant?

Ryan

02/10/2022, 6:17 PM

I’m tempted to switch off TLS to see if everything works properly again, what do you think?

Ryan

02/10/2022, 6:17 PM

Thanks for your time on this so far 🙂

Tomas Touceda

02/10/2022, 6:20 PM

if the certificate had an issue, then it would fail all the time. Temporary failures feels more related to something else. Eg: unreliable infra on the GCP side, or increased CPU usage for the TLS layer on the fleet instances

Tomas Touceda

02/10/2022, 6:20 PM

what does the CPU/memory usage look like for the fleet instances now compared to when TLS was on?

Ryan

02/10/2022, 6:23 PM

Basically the same - it’s hardly doing any work.

Ryan

02/10/2022, 6:24 PM

~6% CPU, 3MB out of 1GB capacity

Ryan

02/10/2022, 6:24 PM

client activity

Ryan

02/10/2022, 6:24 PM

this for the past 6 hours ^

Ryan

02/11/2022, 4:38 PM

For the time being I’ve turned off TLS on our Redis Memorystore instance. Interestingly, even with the non-TLS setup, I see a lot of this logged when executing a distributed query:

fleet[9704]: level=error ts=2022-02-11T162951.534788035Z component=http method=POST uri=/api/v1/osquery/distributed/write took=196.493333ms ip_addr=<ip>:51318 x_for_ip_addr= ingestion-err=“campaign waiting for listener (please retry)” err=“timestamp: 2022-02-11T162951Z: error in query ingestion”

However, the distributed query executes just fine and I get results back from all hosts, so I’m not really sure what’s going on, but I’m going to leave it in this configuration for a while and see how it goes.

Ryan

02/11/2022, 4:38 PM

We have 1274 hosts in this Fleet setup FYI.

13 Views

Open in Slack

Previous Next