hey, why is fleet making so many redis calls now? ...
# fleet
j
hey, why is fleet making so many redis calls now? we're seeing connections max out on our redis cluster (10 nodes each with 65000 connections)
Copy code
Jul 29 19:38:53 <http://osquery-service-vaa32.ec2.vzbuilders.com|osquery-service-vaa32.ec2.vzbuilders.com> fleet[8471]: {"component":"service","err":"retrieve live queries: scan active queries: scan keys: redisc: failed to get a connection","ip_addr":"127.0.0.1:42004","level":"info","method":"GetDistributedQueries","took":"7.691258278s","ts":"2021-07-29T19:38:53.186431717Z","x_for_ip_addr":"216.155.204.8"}
Jul 29 19:38:53 <http://osquery-service-vaa32.ec2.vzbuilders.com|osquery-service-vaa32.ec2.vzbuilders.com> fleet[8471]: {"component":"service","err":"retrieve live queries: scan active queries: scan keys: redisc: failed to get a connection","ip_addr":"127.0.0.1:42000","level":"info","method":"GetDistributedQueries","took":"7.704909626s","ts":"2021-07-29T19:38:53.186638487Z","x_for_ip_addr":"98.139.22.220"}
Jul 29 19:38:53 <http://osquery-service-vaa32.ec2.vzbuilders.com|osquery-service-vaa32.ec2.vzbuilders.com> fleet[8471]: {"component":"service","err":"retrieve live queries: scan active queries: scan keys: redisc: failed to get a connection","ip_addr":"127.0.0.1:39646","level":"info","method":"GetDistributedQueries","took":"9.993457634s","ts":"2021-07-29T19:38:53.186933079Z","x_for_ip_addr":"98.139.22.220"}
z
What version of Fleet are you using? Which did you upgrade from? We have not seen Fleet using up Redis connections in any way similar to that in the past.
j
we upgraded from 3.10 to 4.0.1 last week, and then to 4.1.0 yesterday
it looks like every node in the fleet is periodically doing this
Copy code
"error": "retrieve live queries: scan active queries: scan keys: redisc: failed to get a connection"
r
Hey Jocelyn, can you give us more info on how you configured Redis?
j
I got our connections sorted by turning off tcp keepalive and setting the connection idle timeout to 20 seconds, but we're still getting this error when attempting live queries
Copy code
Live query request failed
Error: Unknown error: TypeError: Cannot read property '0' of undefined
our fleet redis config is
Copy code
redis:
  address: 127.0.0.1:6379
  password: ${redis_auth}
we're using stunnel to connect to our global elasticache redis cluster
Copy code
fips = no
setuid = root
setgid = root
pid = /var/run/stunnel.pid
debug = 7 
delay = yes
options = NO_SSLv2
options = NO_SSLv3
[redis-cli]
   client = yes
   accept = 127.0.0.1:6379
   connect = ${redis_m}
[redis-cli-replica]
   client = yes
   accept = 127.0.0.1:6380
   connect = ${redis_r}
z
Can you look at the network inspector in your browser devtools and see if there's any more details on the error in the response from the Fleet server?
j
we have redis-cli installed too, if there's a query I could run manually to generate additional data
z
Is there possibly some
stunnel
configuration that is causing this?
j
I can't rule it out, but it was working successfully with the same config before we upgraded to the latest fleet
we are seeing fleet connect to redis
we disabled encryption so we could get rid of stunnel and connect directly to redis from fleet
Copy code
redis:
#  address: 127.0.0.1:6379
  address: [cluster-name].<http://nl4nlg.clustercfg.use1.cache.amazonaws.com:6380|nl4nlg.clustercfg.use1.cache.amazonaws.com:6380>
still getting errors on live queries
Copy code
[root@osquery-service-orb164 log]# redis-cli -h [cluster-name].<http://nl4nlg.clustercfg.use1.cache.amazonaws.com|nl4nlg.clustercfg.use1.cache.amazonaws.com> -p 6380
[cluster-name].<http://nl4nlg.clustercfg.use1.cache.amazonaws.com:6380|nl4nlg.clustercfg.use1.cache.amazonaws.com:6380>> scan 0
1) "0"
2) (empty array)
and the redis cache appears to be empty