Title
#fleet
Jocelyn Bothe

Jocelyn Bothe

09/08/2021, 6:37 PM
still having issues with redis, even after upgrading to 4.2.4 today. We're now seeing the error
err="scan keys: dial tcp 10.10.24.224:6380: i/o timeout" msg="failed to migrate live query redis keys"
Tomas Touceda

Tomas Touceda

09/08/2021, 7:42 PM
we changed how some things behave in 4.2.4 and improved them further in 4.3.0 around redis clusters. It might be worth a try
7:42 PM
that said, can you connect directly to that node with redis-cli?
Jocelyn Bothe

Jocelyn Bothe

09/08/2021, 7:42 PM
yup, redis-cli works fine
7:42 PM
but it shows no keys
Tomas Touceda

Tomas Touceda

09/08/2021, 7:43 PM
does it take longer than 5s to connect?
7:43 PM
it's ok if there are no keys
Jocelyn Bothe

Jocelyn Bothe

09/08/2021, 7:44 PM
nope, it was definitely less than 5 seconds
Tomas Touceda

Tomas Touceda

09/08/2021, 7:45 PM
and this is from the same host that is running fleet serve, correct?
Jocelyn Bothe

Jocelyn Bothe

09/08/2021, 7:45 PM
yes
7:47 PM
if I enable debug logging, will it have more info about that redis timeout?
7:47 PM
Sep 08 19:44:34 <http://osquery-service-vab147.ec2.vzbuilders.com|osquery-service-vab147.ec2.vzbuilders.com> systemd[1]: Started Kolide Fleet.
Sep 08 19:44:34 <http://osquery-service-vab147.ec2.vzbuilders.com|osquery-service-vab147.ec2.vzbuilders.com> fleet[15741]: Using config file:  /etc/kolide/fleet.yml
Sep 08 19:44:51 <http://osquery-service-vab147.ec2.vzbuilders.com|osquery-service-vab147.ec2.vzbuilders.com> fleet[15741]: level=info ts=2021-09-08T19:44:51.921971702Z err="scan keys: dial tcp 10.10.24.226:6380: i/o timeout" msg="failed to migrate live query redis keys"
7:47 PM
there is more than 5 seconds between fleet reading the config yaml and getting the redis error on startup
Tomas Touceda

Tomas Touceda

09/08/2021, 7:48 PM
yeah, something is preventing it to connect fast enough, or at all
Jocelyn Bothe

Jocelyn Bothe

09/08/2021, 7:50 PM
okay, interesting, when I restarted in debug it didn't get the error, so it seems intermittent
7:51 PM
is that 5s timeout going to be configurable in 4.3?
Tomas Touceda

Tomas Touceda

09/08/2021, 7:51 PM
are the metrics for the redis cluster ok?
7:51 PM
I'm looking into the timeout to see what we can provide
Jocelyn Bothe

Jocelyn Bothe

09/08/2021, 7:51 PM
yeah, the cluster is basically doing nothing right now
Tomas Touceda

Tomas Touceda

09/08/2021, 7:52 PM
is it in the same region as the fleet instance?
Jocelyn Bothe

Jocelyn Bothe

09/08/2021, 7:53 PM
it is a global cluster, but yes, the instance I'm testing on is in the same region
7:53 PM
as the primary redis
Tomas Touceda

Tomas Touceda

09/08/2021, 7:53 PM
gotcha, thank you for answering my million questions
Jocelyn Bothe

Jocelyn Bothe

09/08/2021, 7:53 PM
thanks for helping me troubleshoot
Tomas Touceda

Tomas Touceda

09/08/2021, 7:53 PM
we'll make the timeout configurable
Jocelyn Bothe

Jocelyn Bothe

09/08/2021, 7:53 PM
👍
7:58 PM
and if you're taking requests, a configurable retry for that connection would be lovely 🙂
Tomas Touceda

Tomas Touceda

09/08/2021, 8:00 PM
always welcoming requests!
8:02 PM
will look into retries, the issue is at the connection level, key migrations are just one point that uses that
8:13 PM
retries will probably not make it to 4.3.0, could you create a feature request issue: https://github.com/fleetdm/fleet/issues/new?assignees=&amp;labels=idea&amp;template=feature-request.md&amp;title= ? as for the timeout, https://github.com/fleetdm/fleet/pull/1968
Jocelyn Bothe

Jocelyn Bothe

09/08/2021, 8:19 PM
done!
mikermcneil

mikermcneil

09/09/2021, 8:08 PM
fyi here's the link to @Jocelyn Bothe's GitHub issue: https://github.com/fleetdm/fleet/issues/1969 (Thanks Jocelyn!)