Osquery uses basic SQL commands to leverage a relational data-model to describe a device.

osquery

still having issues with redis, even after upgrading to 4.2.4 today.  We're now seeing the error
```err="scan keys: dial tcp 10.10.24.224:6380: i/o timeout" msg="failed to migrate live query redis keys"```

we changed how some things behave in 4.2.4 and improved them further in 4.3.0 around redis clusters. It might be worth a try

that said, can you connect directly to that node with redis-cli?

<http://osq-cluster-east.nl4nlg.clustercfg.use1.cache.amazonaws.com:6380|osq-cluster-east.nl4nlg.clustercfg.use1.cache.amazonaws.com:6380>&gt; scan 0
1) "0"
2) (empty array)

nope, it was definitely less than 5 seconds

and this is from the same host that is running fleet serve, correct?

if I enable debug logging, will it have more info about that redis timeout?

```Sep 08 19:44:34 <http://osquery-service-vab147.ec2.vzbuilders.com|osquery-service-vab147.ec2.vzbuilders.com> systemd[1]: Started Kolide Fleet.
Sep 08 19:44:34 <http://osquery-service-vab147.ec2.vzbuilders.com|osquery-service-vab147.ec2.vzbuilders.com> fleet[15741]: Using config file:  /etc/kolide/fleet.yml
Sep 08 19:44:51 <http://osquery-service-vab147.ec2.vzbuilders.com|osquery-service-vab147.ec2.vzbuilders.com> fleet[15741]: level=info ts=2021-09-08T19:44:51.921971702Z err="scan keys: dial tcp 10.10.24.226:6380: i/o timeout" msg="failed to migrate live query redis keys"```

there is more than 5 seconds between fleet reading the config yaml and getting the redis error on startup

yeah, something is preventing it to connect fast enough, or at all

okay, interesting, when I restarted in debug it didn't get the error, so it seems intermittent

is that 5s timeout going to be configurable in 4.3?

are the metrics for the redis cluster ok?

I'm looking into the timeout to see what we can provide

yeah, the cluster is basically doing nothing right now

is it in the same region as the fleet instance?

it is a global cluster, but yes, the instance I'm testing on is in the same region

gotcha, thank you for answering my million questions

and if you're taking requests, a configurable retry for that connection would be lovely :slightly_smiling_face:

will look into retries, the issue is at the connection level, key migrations are just one point that uses that

retries will probably not make it to 4.3.0, could you create a feature request issue: <https://github.com/fleetdm/fleet/issues/new?assignees=&amp;labels=idea&amp;template=feature-request.md&amp;title=> ? as for the timeout, <https://github.com/fleetdm/fleet/pull/1968>

fyi here's the link to <@U01MZDZQGA2>'s GitHub issue: <https://github.com/fleetdm/fleet/issues/1969>

(Thanks Jocelyn!)