still having issues with redis, even after upgradi...
# fleet
j
still having issues with redis, even after upgrading to 4.2.4 today. We're now seeing the error
Copy code
err="scan keys: dial tcp 10.10.24.224:6380: i/o timeout" msg="failed to migrate live query redis keys"
t
we changed how some things behave in 4.2.4 and improved them further in 4.3.0 around redis clusters. It might be worth a try
that said, can you connect directly to that node with redis-cli?
j
yup, redis-cli works fine
but it shows no keys
t
does it take longer than 5s to connect?
it's ok if there are no keys
j
nope, it was definitely less than 5 seconds
t
and this is from the same host that is running fleet serve, correct?
j
yes
if I enable debug logging, will it have more info about that redis timeout?
Copy code
Sep 08 19:44:34 <http://osquery-service-vab147.ec2.vzbuilders.com|osquery-service-vab147.ec2.vzbuilders.com> systemd[1]: Started Kolide Fleet.
Sep 08 19:44:34 <http://osquery-service-vab147.ec2.vzbuilders.com|osquery-service-vab147.ec2.vzbuilders.com> fleet[15741]: Using config file:  /etc/kolide/fleet.yml
Sep 08 19:44:51 <http://osquery-service-vab147.ec2.vzbuilders.com|osquery-service-vab147.ec2.vzbuilders.com> fleet[15741]: level=info ts=2021-09-08T19:44:51.921971702Z err="scan keys: dial tcp 10.10.24.226:6380: i/o timeout" msg="failed to migrate live query redis keys"
there is more than 5 seconds between fleet reading the config yaml and getting the redis error on startup
t
yeah, something is preventing it to connect fast enough, or at all
j
okay, interesting, when I restarted in debug it didn't get the error, so it seems intermittent
is that 5s timeout going to be configurable in 4.3?
t
are the metrics for the redis cluster ok?
I'm looking into the timeout to see what we can provide
j
yeah, the cluster is basically doing nothing right now
t
is it in the same region as the fleet instance?
j
it is a global cluster, but yes, the instance I'm testing on is in the same region
as the primary redis
t
gotcha, thank you for answering my million questions
j
thanks for helping me troubleshoot
t
we'll make the timeout configurable
j
👍
and if you're taking requests, a configurable retry for that connection would be lovely 🙂
t
always welcoming requests!
will look into retries, the issue is at the connection level, key migrations are just one point that uses that
retries will probably not make it to 4.3.0, could you create a feature request issue: https://github.com/fleetdm/fleet/issues/new?assignees=&amp;labels=idea&amp;template=feature-request.md&amp;title= ? as for the timeout, https://github.com/fleetdm/fleet/pull/1968
j
done!
ty 1
m
fyi here's the link to @Jocelyn Bothe's GitHub issue: https://github.com/fleetdm/fleet/issues/1969 (Thanks Jocelyn!)
👍 1