Hi everyone. We've noticed that our osquery clien...
# general
d
Hi everyone. We've noticed that our osquery clients are opening and tearing down connections each time they check in on the distributed interval. We have
tls_session_reuse: true
. Is that only for the logger perhaps? It's surprising to me that we have constant sessions up and down - this is currently 5000 per min and will go even more crazy as we scale.
j
we actually set it to false, some time ago, as it seemed to give us the best results... i didnt see any real reuse of the sessions
👍 1
d
Thanks Juan.
I had planned to disable it and check to see what happens
z
@npamnani @Seshu Looks like this was an uptycs added feature. Any insight?
s
As far as I can see it seems that it only stores a single http client per thread, which is connected to only one endpoint, and if there are different endpoints to be contacted, then the connection has to be closed and reopened
s
It is supposed to help avoid connection churn. It also depends on how the receiving side is configured (protocol, keep-alive, duration etc.). If you are doing periodic distributed read, it doesn't help for that. It may be useful for logger, assuming there constant data (events etc) being sent
d
i see
@Stefano Bonicatti - we do have an NLB in front of Fleet that is deployed across 3 AWS AZs, so the connections will go across 3 IPs
we do have keepalives enabled
keepalive_timeout  65;  keepalive_requests 100;
i assume osquery uses the system resolver. if it gets 3 IPs back from an nslookup, it doesn't try to connect to all 3 I hope?
s
NLB forwarding to Nginx? We have identical keepalive configuration with Nginx. What is your distributed_interval? We use 0 and we see 2 connections from Osquery to Nginx. Every time config kicks in, it creates another temporary connection that gets closed.
d
@Seshu our distributed interval was 10 sec, we changed to 60 as a test to see if connections dropped (which they did). Yes, NLB --> nginx --> fleet
what is setting the dist interval to 0 supposed to accomplish? never check in?
s
Seconds between polling for new queries (default 60)
. So this is the sleep duration between sending distributed_read requests.
I might be wrong, but the timeout handler could be closing the connection after 16 seconds (hand-coded in tls.cpp). Which means keep-alive for distributed connections might not make any difference. Need to double check!!!
d
oh, does 0 mean it uses the default which is 60 seconds, or did you mean to type that your dist interval was set to 60? you have 0 above. that would be bad if it was hardcoded to 16 sec
s
It is kind'a confusing (I think), but it is the sleep duration between sending distributed requests. So we don't sleep between requests. By default, it sleeps for 60 seconds if you don't change the flag value.
Copy code
./dispatcher/distributed_runner.cpp:      pause(std::chrono::seconds(FLAGS_distributed_interval));
You might want to run for a bit with
--tls_dump
and see what is going on...
d
oh, got it. thanks
ya, I've ran with tls_dump before....things look pretty standard. I can see the check-ins on the interval, etc
s
I usually look for
Connection
header in request/response and also track
netstat
connection churn from Osquery.
d
right
let me know what you find out about the timeout handler....if you are looking at that. if it's hardcoded for the distributed endpoints, nothing we do in nginx is going to make any difference
also, with distributed interval the docs say
--distributed_interval=60
In seconds, the amount of time that osqueryd will wait between periodically checking in with a distributed query server to see if there are any queries to execute. This is how I understood it. Your example above @Seshu says you set this to 0. That seems crazy. how can your clients check in without any interval between subsequent check-ins?
s
🙂 In our config, we don't pause. We always have Osquery waiting for queries. We can send a query to thousands of hosts and get back answer in a second. If we pause, we will have to wait that amount of time to get responses back.
d
sure...that's why we set it to 10
🙂
ok, make sense I guess
perhaps that's better from a connection standpoint then
we should try that
z
@Seshu how many RPS does a single osquery instance generate with the interval set to 0? Do you use a strategy in which the server waits to respond to the request until there is a query ready?
d
interesting to think about. ya, you'd get a lot of request to the read api with that set to 0...as fast as it can do them? how would that even work?
s
Every distributed query waits for 12 odd seconds before the server sends back empty response. If one or more queries come through, we immediately send them to Osquery on the waiting socket. Osquery immediately sends another request without delay once it gets response from back-end. With keep-alive, it will be on the same connection. Like I mentioned above, it usually is 2 connections per host (1 for DR, 1 for Log). And an occasional config connection.
👍 1
z
Ah, interesting strategy! Thanks for sharing 🙂
d
ya, that's slick. I like it
z
How does your agent behave when offline? Does it use a lot of CPU spamming distributed query connection attempts? Or is this something you changed in your agent?
s
It does exponential back-off...
z
Is that osquery functionality or something you changed?
s
I need to check what is in OSS and if it is same as what we have .. 😞
z
Ha no worries, I can check osquery. Thanks for the insights!
👍 1
s
z
Ah nice, thanks Stefano. Makes sense we have that for all requests.
d
how many failed attempts until it starts backing off? Considering that our clients (currently) can only communicate to Fleet when on an end-user initiated VPN connection, we definitely don't have persistent connectivity. I don't think we have any problems when
distributed_interval
is set to 10 seconds, so I'm guessing it won't be an issue with 0. I'll have to test this out.
s
Hum so, the backoff happens immediately at the first failure, but the caveat is that the retry attempts for that specific request are configured by a flag that depends on where the request originated from. In the case of the distributed plugin, the default is 3 attempts total (
distributed_tls_max_attempts
) which means, 1 immediate, then if the previous failed, 1 after 1 second, 1 after 4 seconds. After this, the
distributed_interval
kicks in.
d
ah, thanks @Stefano Bonicatti. I should have realized as I'm familiar with the max attempts setting
z
@Dan Achin with Fleet you'll get a response to a distributed request pretty much immediately so you're going to have osquery making a lot of requests if you set the interval to 0.
d
Hmm, it sounds to me like that would actually be less though based on the info above: https://osquery.slack.com/archives/C08V7KTJB/p1619218588257100?thread_ts=1619213725.247500&cid=C08V7KTJB What am I missing?
well, one every 12 seconds anyways. but a lot less connections
z
Fleet doesn't wait to return an empty response.
d
oh...was he talking about something other than Fleet? i guess i assumed as I thought I was in the Fleet channel. 🙂
so then ya...might not be the best idea for us.
z
He was talking about the Uptycs commercial project which uses a fork of osquery.
d
K, thanks. Looks like we'll need to figure out why we see so many connections from clients to Fleet. The last count I had was around 300 per system and every check-in to the read endpoint is generating a new one.
z
I wonder if this could be a bug in the
tls_session_timeout
implementation? Would be good to file an issue with osquery if you can come up with a repro.
🙏 1
d
ah...i should have been looking at that setting. I could try setting it to 0 and allowing nginx to manage the sessions
let me see what I can figure out
@Seshu - were you ever able to confirm this?
Copy code
I might be wrong, but the timeout handler could be closing the connection after 16 seconds (hand-coded in tls.cpp). Which means keep-alive for distributed connections might not make any difference. Need to double check!!!
I've been tweaking
tls_session_timeout
on our clients and I'm not seeing any difference with #s of connections opened / closed from that. The only thing I'm seeing that reduces connection churn from clients is to increase our
distributed_interval
. I'm thinking I'll open an osquery issue, but wondered if you confirmed a hard-coded timeout in tls.cpp.
s
Sorry. Haven't looked into it @Dan Achin. Don't have the OSS Osquery setup handy. Will try to get to it...
d
ok, thanks. let me know if you get to it, would appreciate it. I'll open a git issue today at some point.
s
Thanks. Will look into this tonight...
🙏 1
@Stefano Bonicatti link to http_client.cpp:339 is causing socket to close all the time. At least I see that all log requests are hitting the code path. Both
new_client_options_
and
client_options_.ssl_connection_ != ssl_connection
are
true
for TLS logger
🙏 1
d
@Stefano Bonicatti - would appreciate your feedback on this. Thanks!