Title
#general
Dan Achin

Dan Achin

04/23/2021, 9:35 PM
Hi everyone. We've noticed that our osquery clients are opening and tearing down connections each time they check in on the distributed interval. We have
tls_session_reuse: true
. Is that only for the logger perhaps? It's surprising to me that we have constant sessions up and down - this is currently 5000 per min and will go even more crazy as we scale.
j

Juan Alvarez

04/23/2021, 9:36 PM
we actually set it to false, some time ago, as it seemed to give us the best results... i didnt see any real reuse of the sessions
Dan Achin

Dan Achin

04/23/2021, 9:36 PM
Thanks Juan.
9:37 PM
I had planned to disable it and check to see what happens
zwass

zwass

04/23/2021, 9:39 PM
@npamnani @Seshu Looks like this was an uptycs added feature. Any insight?
Stefano Bonicatti

Stefano Bonicatti

04/23/2021, 9:45 PM
As far as I can see it seems that it only stores a single http client per thread, which is connected to only one endpoint, and if there are different endpoints to be contacted, then the connection has to be closed and reopened
s

Seshu

04/23/2021, 9:49 PM
It is supposed to help avoid connection churn. It also depends on how the receiving side is configured (protocol, keep-alive, duration etc.). If you are doing periodic distributed read, it doesn't help for that. It may be useful for logger, assuming there constant data (events etc) being sent
Dan Achin

Dan Achin

04/23/2021, 10:05 PM
i see
10:06 PM
@Stefano Bonicatti - we do have an NLB in front of Fleet that is deployed across 3 AWS AZs, so the connections will go across 3 IPs
10:06 PM
we do have keepalives enabled
10:06 PM
keepalive_timeout  65;  keepalive_requests 100;
10:09 PM
i assume osquery uses the system resolver. if it gets 3 IPs back from an nslookup, it doesn't try to connect to all 3 I hope?
s

Seshu

04/23/2021, 10:13 PM
NLB forwarding to Nginx? We have identical keepalive configuration with Nginx. What is your distributed_interval? We use 0 and we see 2 connections from Osquery to Nginx. Every time config kicks in, it creates another temporary connection that gets closed.
Dan Achin

Dan Achin

04/23/2021, 10:21 PM
@Seshu our distributed interval was 10 sec, we changed to 60 as a test to see if connections dropped (which they did). Yes, NLB --> nginx --> fleet
10:22 PM
what is setting the dist interval to 0 supposed to accomplish? never check in?
s

Seshu

04/23/2021, 10:23 PM
Seconds between polling for new queries (default 60)
. So this is the sleep duration between sending distributed_read requests.
10:27 PM
I might be wrong, but the timeout handler could be closing the connection after 16 seconds (hand-coded in tls.cpp). Which means keep-alive for distributed connections might not make any difference. Need to double check!!!
Dan Achin

Dan Achin

04/23/2021, 10:29 PM
oh, does 0 mean it uses the default which is 60 seconds, or did you mean to type that your dist interval was set to 60? you have 0 above. that would be bad if it was hardcoded to 16 sec
s

Seshu

04/23/2021, 10:31 PM
It is kind'a confusing (I think), but it is the sleep duration between sending distributed requests. So we don't sleep between requests. By default, it sleeps for 60 seconds if you don't change the flag value.
./dispatcher/distributed_runner.cpp:      pause(std::chrono::seconds(FLAGS_distributed_interval));
10:33 PM
You might want to run for a bit with
--tls_dump
and see what is going on...
Dan Achin

Dan Achin

04/23/2021, 10:35 PM
oh, got it. thanks
10:35 PM
ya, I've ran with tls_dump before....things look pretty standard. I can see the check-ins on the interval, etc
s

Seshu

04/23/2021, 10:37 PM
I usually look for
Connection
header in request/response and also track
netstat
connection churn from Osquery.
Dan Achin

Dan Achin

04/23/2021, 10:37 PM
right
10:38 PM
let me know what you find out about the timeout handler....if you are looking at that. if it's hardcoded for the distributed endpoints, nothing we do in nginx is going to make any difference
10:42 PM
also, with distributed interval the docs say
--distributed_interval=60
In seconds, the amount of time that osqueryd will wait between periodically checking in with a distributed query server to see if there are any queries to execute. This is how I understood it. Your example above @Seshu says you set this to 0. That seems crazy. how can your clients check in without any interval between subsequent check-ins?
s

Seshu

04/23/2021, 10:45 PM
🙂 In our config, we don't pause. We always have Osquery waiting for queries. We can send a query to thousands of hosts and get back answer in a second. If we pause, we will have to wait that amount of time to get responses back.
Dan Achin

Dan Achin

04/23/2021, 10:45 PM
sure...that's why we set it to 10
10:45 PM
🙂
10:45 PM
ok, make sense I guess
10:46 PM
perhaps that's better from a connection standpoint then
10:46 PM
we should try that
zwass

zwass

04/23/2021, 10:47 PM
@Seshu how many RPS does a single osquery instance generate with the interval set to 0? Do you use a strategy in which the server waits to respond to the request until there is a query ready?
Dan Achin

Dan Achin

04/23/2021, 10:49 PM
interesting to think about. ya, you'd get a lot of request to the read api with that set to 0...as fast as it can do them? how would that even work?
s

Seshu

04/23/2021, 10:56 PM
Every distributed query waits for 12 odd seconds before the server sends back empty response. If one or more queries come through, we immediately send them to Osquery on the waiting socket. Osquery immediately sends another request without delay once it gets response from back-end. With keep-alive, it will be on the same connection. Like I mentioned above, it usually is 2 connections per host (1 for DR, 1 for Log). And an occasional config connection.
zwass

zwass

04/23/2021, 10:56 PM
Ah, interesting strategy! Thanks for sharing 🙂
Dan Achin

Dan Achin

04/23/2021, 10:58 PM
ya, that's slick. I like it
zwass

zwass

04/23/2021, 11:00 PM
How does your agent behave when offline? Does it use a lot of CPU spamming distributed query connection attempts? Or is this something you changed in your agent?
s

Seshu

04/23/2021, 11:04 PM
It does exponential back-off...
zwass

zwass

04/23/2021, 11:05 PM
Is that osquery functionality or something you changed?
s

Seshu

04/23/2021, 11:06 PM
I need to check what is in OSS and if it is same as what we have .. 😞
zwass

zwass

04/23/2021, 11:07 PM
Ha no worries, I can check osquery. Thanks for the insights!
zwass

zwass

04/23/2021, 11:13 PM
Ah nice, thanks Stefano. Makes sense we have that for all requests.
Dan Achin

Dan Achin

04/23/2021, 11:13 PM
how many failed attempts until it starts backing off? Considering that our clients (currently) can only communicate to Fleet when on an end-user initiated VPN connection, we definitely don't have persistent connectivity. I don't think we have any problems when
distributed_interval
is set to 10 seconds, so I'm guessing it won't be an issue with 0. I'll have to test this out.
Stefano Bonicatti

Stefano Bonicatti

04/23/2021, 11:22 PM
Hum so, the backoff happens immediately at the first failure, but the caveat is that the retry attempts for that specific request are configured by a flag that depends on where the request originated from. In the case of the distributed plugin, the default is 3 attempts total (
distributed_tls_max_attempts
) which means, 1 immediate, then if the previous failed, 1 after 1 second, 1 after 4 seconds. After this, the
distributed_interval
kicks in.
Dan Achin

Dan Achin

04/23/2021, 11:24 PM
ah, thanks @Stefano Bonicatti. I should have realized as I'm familiar with the max attempts setting
zwass

zwass

04/23/2021, 11:26 PM
@Dan Achin with Fleet you'll get a response to a distributed request pretty much immediately so you're going to have osquery making a lot of requests if you set the interval to 0.
Dan Achin

Dan Achin

04/23/2021, 11:29 PM
Hmm, it sounds to me like that would actually be less though based on the info above: https://osquery.slack.com/archives/C08V7KTJB/p1619218588257100?thread_ts=1619213725.247500&cid=C08V7KTJB What am I missing?
11:29 PM
well, one every 12 seconds anyways. but a lot less connections
zwass

zwass

04/23/2021, 11:30 PM
Fleet doesn't wait to return an empty response.
Dan Achin

Dan Achin

04/23/2021, 11:34 PM
oh...was he talking about something other than Fleet? i guess i assumed as I thought I was in the Fleet channel. 🙂
11:35 PM
so then ya...might not be the best idea for us.
zwass

zwass

04/23/2021, 11:36 PM
He was talking about the Uptycs commercial project which uses a fork of osquery.
Dan Achin

Dan Achin

04/23/2021, 11:37 PM
K, thanks. Looks like we'll need to figure out why we see so many connections from clients to Fleet. The last count I had was around 300 per system and every check-in to the read endpoint is generating a new one.
zwass

zwass

04/23/2021, 11:41 PM
I wonder if this could be a bug in the
tls_session_timeout
implementation? Would be good to file an issue with osquery if you can come up with a repro.
Dan Achin

Dan Achin

04/23/2021, 11:59 PM
ah...i should have been looking at that setting. I could try setting it to 0 and allowing nginx to manage the sessions
11:59 PM
let me see what I can figure out
4:59 PM
@Seshu - were you ever able to confirm this?
I might be wrong, but the timeout handler could be closing the connection after 16 seconds (hand-coded in tls.cpp). Which means keep-alive for distributed connections might not make any difference. Need to double check!!!
I've been tweaking
tls_session_timeout
on our clients and I'm not seeing any difference with #s of connections opened / closed from that. The only thing I'm seeing that reduces connection churn from clients is to increase our
distributed_interval
. I'm thinking I'll open an osquery issue, but wondered if you confirmed a hard-coded timeout in tls.cpp.
s

Seshu

04/27/2021, 5:01 PM
Sorry. Haven't looked into it @Dan Achin. Don't have the OSS Osquery setup handy. Will try to get to it...
Dan Achin

Dan Achin

04/27/2021, 5:04 PM
ok, thanks. let me know if you get to it, would appreciate it. I'll open a git issue today at some point.
s

Seshu

04/28/2021, 12:31 AM
Thanks. Will look into this tonight...
9:26 PM
@Stefano Bonicatti link to http_client.cpp:339 is causing socket to close all the time. At least I see that all log requests are hitting the code path. Both
new_client_options_
and
client_options_.ssl_connection_ != ssl_connection
are
true
for TLS logger
Dan Achin

Dan Achin

04/30/2021, 9:15 PM
@Stefano Bonicatti - would appreciate your feedback on this. Thanks!