Fleet seems to be logging a lot of lines like this...
# fleet
r
Fleet seems to be logging a lot of lines like this during startup, does anyone know what they mean?
Copy code
level=error ts=2022-02-08T18:18:27.469599841Z op=QueriesForHost err="load active queries: EOF"
t
that part reaches out to redis to check for live queries that might be running
is redis running properly?
r
ah right, so EOF could mean a problem connecting to Redis?
I have suspicions that it isn’t able to connect properly
thanks for the pointer 👍
t
yup, it connects to redis and runs a SMEMBERS, if at some point something just chops off the connection, that's the error that could happen
ty 1
r
right oh
i’ll double check the setup
Still debugging, but I may have to continue tomorrow.
Incidentally, we’re seeing this logged a lot when executing distributed queries:
Copy code
method=POST uri=/api/v1/osquery/distributed/write took=1.86989925s ip_addr=<ip>:45002 x_for_ip_addr= ingestion-err="campaign waiting for listener (please retry)" err="timestamp: 2022-02-08T18:57:28Z: error in query ingestion
I’ll pick this up again tomorrow.
👍 1
t
certainly looks like something around redis misbehaving
r
Hi @Tomas Touceda I’m still trying to debug what’s going on here - the Redis instance itself seems fine, but we’re still seeing all sorts of weird behaviour in Fleet as a result. Is there a quick way to verify the connection from Fleet to Redis is ok?
I’m going to verify it using
tlsconnect.go
Copy code
2022/02/09 10:50:12 pool created successfully
2022/02/09 10:50:13 command result: NOAUTH Authentication required. ; NOAUTH Authentication required.
Looks good! the AUTH value is set in the Fleet.yml, doesn’t seem to be possible to test that with the
tlsconnect.go
tool though.
Copy code
ingestion-err="writing results: PUBLISH failed to channel results_49: EOF" err="timestamp: 2022-02-09T10:57:52Z: error in query ingestion"
the weird thing is - Fleet is logging all of these errors, but Live Queries do work, as do scheduled query packs, so I have no idea what the issue is 🤔
this is using a GCP Memorystore instance with TLS and AUTH enabled in case anyone can reproduce this.
t
let me consult with the team to see if anybody else has any other ideas
r
thanks 🙂
There are some constraints on Memorystore compared to “normal” Redis, but I don’t think any of them should affect us: https://cloud.google.com/memorystore/docs/redis/product-constraints
b
Are you using Standard or Basic tier?
r
Basic tier for this particular instance I think.
Yeah it’s basic tier.
I’m pretty sure the issues being logged started when switching to TLS though, we used the same Basic Memorystore without TLS previously and it was fine I think.
Stupidly we changed several things at once though, including upgrading to v4.9.1.
b
There are other limitations for basic tier but not sure yet how they might be affecting your setup https://cloud.google.com/memorystore/docs/redis/product-constraints#basic_tier_limitations
r
Yeah those are acceptable I think, it’s basically maintenance windows they will upgrade, but no maintenance has been performed so far.
looks like our scheduled queries continued to work overnight by the way
seems like Live Queries are working still too
actually, no they’re not working 100%
hanging at Online: 1276 hosts / 1047 results
so yeah I suspect it isn’t happy with the TLS feature for some reason
Update for you both @Benjamin Edwards and @Tomas Touceda, running another Fleet instance with Memorystore Redis when TLS is disabled is working absolutely fine, so we’re suspecting there’s something not quite right when TLS is enabled, but it’s not outright failing, just not working correctly. The only thing I can think of that isn’t quite right, is GCP Memorystore signs the certificate using the IP address of the instance, not the DNS name we specify in
address
? But we do set the
tls_server_name
config option to the IP to fix that error, so it should be fine? It’s a Redis 6 instance too, is that significant?
I’m tempted to switch off TLS to see if everything works properly again, what do you think?
Thanks for your time on this so far 🙂
t
if the certificate had an issue, then it would fail all the time. Temporary failures feels more related to something else. Eg: unreliable infra on the GCP side, or increased CPU usage for the TLS layer on the fleet instances
what does the CPU/memory usage look like for the fleet instances now compared to when TLS was on?
r
Basically the same - it’s hardly doing any work.
~6% CPU, 3MB out of 1GB capacity
client activity
this for the past 6 hours ^
For the time being I’ve turned off TLS on our Redis Memorystore instance. Interestingly, even with the non-TLS setup, I see a lot of this logged when executing a distributed query:
fleet[9704]: level=error ts=2022-02-11T162951.534788035Z component=http method=POST uri=/api/v1/osquery/distributed/write took=196.493333ms ip_addr=<ip>:51318 x_for_ip_addr= ingestion-err=“campaign waiting for listener (please retry)” err=“timestamp: 2022-02-11T162951Z: error in query ingestion”
However, the distributed query executes just fine and I get results back from all hosts, so I’m not really sure what’s going on, but I’m going to leave it in this configuration for a while and see how it goes.
We have 1274 hosts in this Fleet setup FYI.