Title
#fleet
r

Ryan

02/08/2022, 6:20 PM
Fleet seems to be logging a lot of lines like this during startup, does anyone know what they mean?
level=error ts=2022-02-08T18:18:27.469599841Z op=QueriesForHost err="load active queries: EOF"
Tomas Touceda

Tomas Touceda

02/08/2022, 6:24 PM
that part reaches out to redis to check for live queries that might be running
6:24 PM
is redis running properly?
r

Ryan

02/08/2022, 6:24 PM
ah right, so EOF could mean a problem connecting to Redis?
6:24 PM
I have suspicions that it isn’t able to connect properly
6:24 PM
thanks for the pointer 👍
Tomas Touceda

Tomas Touceda

02/08/2022, 6:24 PM
yup, it connects to redis and runs a SMEMBERS, if at some point something just chops off the connection, that's the error that could happen
r

Ryan

02/08/2022, 6:24 PM
right oh
6:25 PM
i’ll double check the setup
6:58 PM
Still debugging, but I may have to continue tomorrow.
6:58 PM
Incidentally, we’re seeing this logged a lot when executing distributed queries:
method=POST uri=/api/v1/osquery/distributed/write took=1.86989925s ip_addr=<ip>:45002 x_for_ip_addr= ingestion-err="campaign waiting for listener (please retry)" err="timestamp: 2022-02-08T18:57:28Z: error in query ingestion
6:59 PM
I’ll pick this up again tomorrow.
Tomas Touceda

Tomas Touceda

02/08/2022, 6:59 PM
certainly looks like something around redis misbehaving
r

Ryan

02/09/2022, 10:31 AM
Hi @Tomas Touceda I’m still trying to debug what’s going on here - the Redis instance itself seems fine, but we’re still seeing all sorts of weird behaviour in Fleet as a result. Is there a quick way to verify the connection from Fleet to Redis is ok?
10:39 AM
I’m going to verify it using
tlsconnect.go
10:51 AM
2022/02/09 10:50:12 pool created successfully
2022/02/09 10:50:13 command result: NOAUTH Authentication required. ; NOAUTH Authentication required.
Looks good! the AUTH value is set in the Fleet.yml, doesn’t seem to be possible to test that with the
tlsconnect.go
tool though.
11:00 AM
ingestion-err="writing results: PUBLISH failed to channel results_49: EOF" err="timestamp: 2022-02-09T10:57:52Z: error in query ingestion"
11:01 AM
the weird thing is - Fleet is logging all of these errors, but Live Queries do work, as do scheduled query packs, so I have no idea what the issue is 🤔
11:01 AM
this is using a GCP Memorystore instance with TLS and AUTH enabled in case anyone can reproduce this.
Tomas Touceda

Tomas Touceda

02/09/2022, 11:04 AM
let me consult with the team to see if anybody else has any other ideas
r

Ryan

02/09/2022, 11:04 AM
thanks 🙂
11:06 AM
There are some constraints on Memorystore compared to “normal” Redis, but I don’t think any of them should affect us: https://cloud.google.com/memorystore/docs/redis/product-constraints
Benjamin Edwards

Benjamin Edwards

02/09/2022, 1:33 PM
Are you using Standard or Basic tier?
r

Ryan

02/09/2022, 4:29 PM
Basic tier for this particular instance I think.
4:30 PM
Yeah it’s basic tier.
4:30 PM
I’m pretty sure the issues being logged started when switching to TLS though, we used the same Basic Memorystore without TLS previously and it was fine I think.
4:31 PM
Stupidly we changed several things at once though, including upgrading to v4.9.1.
Benjamin Edwards

Benjamin Edwards

02/09/2022, 4:37 PM
There are other limitations for basic tier but not sure yet how they might be affecting your setup https://cloud.google.com/memorystore/docs/redis/product-constraints#basic_tier_limitations
r

Ryan

02/10/2022, 11:42 AM
Yeah those are acceptable I think, it’s basically maintenance windows they will upgrade, but no maintenance has been performed so far.
11:42 AM
looks like our scheduled queries continued to work overnight by the way
11:43 AM
seems like Live Queries are working still too
11:54 AM
actually, no they’re not working 100%
11:54 AM
hanging at Online: 1276 hosts / 1047 results
11:54 AM
so yeah I suspect it isn’t happy with the TLS feature for some reason
6:16 PM
Update for you both @Benjamin Edwards and @Tomas Touceda, running another Fleet instance with Memorystore Redis when TLS is disabled is working absolutely fine, so we’re suspecting there’s something not quite right when TLS is enabled, but it’s not outright failing, just not working correctly. The only thing I can think of that isn’t quite right, is GCP Memorystore signs the certificate using the IP address of the instance, not the DNS name we specify in
address
? But we do set the
tls_server_name
config option to the IP to fix that error, so it should be fine? It’s a Redis 6 instance too, is that significant?
6:17 PM
I’m tempted to switch off TLS to see if everything works properly again, what do you think?
6:17 PM
Thanks for your time on this so far 🙂
Tomas Touceda

Tomas Touceda

02/10/2022, 6:20 PM
if the certificate had an issue, then it would fail all the time. Temporary failures feels more related to something else. Eg: unreliable infra on the GCP side, or increased CPU usage for the TLS layer on the fleet instances
6:20 PM
what does the CPU/memory usage look like for the fleet instances now compared to when TLS was on?
r

Ryan

02/10/2022, 6:23 PM
Basically the same - it’s hardly doing any work.
6:24 PM
~6% CPU, 3MB out of 1GB capacity
6:24 PM
client activity
6:24 PM
this for the past 6 hours ^
4:38 PM
For the time being I’ve turned off TLS on our Redis Memorystore instance. Interestingly, even with the non-TLS setup, I see a lot of this logged when executing a distributed query:
fleet[9704]: level=error ts=2022-02-11T16:29:51.534788035Z component=http method=POST uri=/api/v1/osquery/distributed/write took=196.493333ms ip_addr=<ip>:51318 x_for_ip_addr= ingestion-err=“campaign waiting for listener (please retry)” err=“timestamp: 2022-02-11T16:29:51Z: error in query ingestion”
However, the distributed query executes just fine and I get results back from all hosts, so I’m not really sure what’s going on, but I’m going to leave it in this configuration for a while and see how it goes.
4:38 PM
We have 1274 hosts in this Fleet setup FYI.