Hi everyone We ve noticed that our osquery clients are openi osquery #general

Hi everyone. We've noticed that our osquery clien...

Dan Achin

04/23/2021, 9:35 PM

Hi everyone. We've noticed that our osquery clients are opening and tearing down connections each time they check in on the distributed interval. We have

tls_session_reuse: true

. Is that only for the logger perhaps? It's surprising to me that we have constant sessions up and down - this is currently 5000 per min and will go even more crazy as we scale.

Juan Alvarez

04/23/2021, 9:36 PM

we actually set it to false, some time ago, as it seemed to give us the best results... i didnt see any real reuse of the sessions

👍 1

Dan Achin

04/23/2021, 9:36 PM

Thanks Juan.

Dan Achin

04/23/2021, 9:37 PM

I had planned to disable it and check to see what happens

zwass

04/23/2021, 9:39 PM

@npamnani @Seshu Looks like this was an uptycs added feature. Any insight?

Stefano Bonicatti

04/23/2021, 9:45 PM

As far as I can see it seems that it only stores a single http client per thread, which is connected to only one endpoint, and if there are different endpoints to be contacted, then the connection has to be closed and reopened

Seshu

04/23/2021, 9:49 PM

It is supposed to help avoid connection churn. It also depends on how the receiving side is configured (protocol, keep-alive, duration etc.). If you are doing periodic distributed read, it doesn't help for that. It may be useful for logger, assuming there constant data (events etc) being sent

Dan Achin

04/23/2021, 10:05 PM

i see

Dan Achin

04/23/2021, 10:06 PM

@Stefano Bonicatti - we do have an NLB in front of Fleet that is deployed across 3 AWS AZs, so the connections will go across 3 IPs

Dan Achin

04/23/2021, 10:06 PM

we do have keepalives enabled

Dan Achin

04/23/2021, 10:06 PM

keepalive_timeout 65; keepalive_requests 100;

Dan Achin

04/23/2021, 10:09 PM

i assume osquery uses the system resolver. if it gets 3 IPs back from an nslookup, it doesn't try to connect to all 3 I hope?

Seshu

04/23/2021, 10:13 PM

NLB forwarding to Nginx? We have identical keepalive configuration with Nginx. What is your distributed_interval? We use 0 and we see 2 connections from Osquery to Nginx. Every time config kicks in, it creates another temporary connection that gets closed.

Dan Achin

04/23/2021, 10:21 PM

@Seshu our distributed interval was 10 sec, we changed to 60 as a test to see if connections dropped (which they did). Yes, NLB --> nginx --> fleet

Dan Achin

04/23/2021, 10:22 PM

what is setting the dist interval to 0 supposed to accomplish? never check in?

Seshu

04/23/2021, 10:23 PM

Seconds between polling for new queries (default 60)

. So this is the sleep duration between sending distributed_read requests.

Seshu

04/23/2021, 10:27 PM

I might be wrong, but the timeout handler could be closing the connection after 16 seconds (hand-coded in tls.cpp). Which means keep-alive for distributed connections might not make any difference. Need to double check!!!

Dan Achin

04/23/2021, 10:29 PM

oh, does 0 mean it uses the default which is 60 seconds, or did you mean to type that your dist interval was set to 60? you have 0 above. that would be bad if it was hardcoded to 16 sec

Seshu

04/23/2021, 10:31 PM

It is kind'a confusing (I think), but it is the sleep duration between sending distributed requests. So we don't sleep between requests. By default, it sleeps for 60 seconds if you don't change the flag value.

Copy code

./dispatcher/distributed_runner.cpp:      pause(std::chrono::seconds(FLAGS_distributed_interval));

Seshu

04/23/2021, 10:33 PM

You might want to run for a bit with

--tls_dump

and see what is going on...

Dan Achin

04/23/2021, 10:35 PM

oh, got it. thanks

Dan Achin

04/23/2021, 10:35 PM

ya, I've ran with tls_dump before....things look pretty standard. I can see the check-ins on the interval, etc

Seshu

04/23/2021, 10:37 PM

I usually look for

Connection

header in request/response and also track

netstat

connection churn from Osquery.

Dan Achin

04/23/2021, 10:37 PM

right

Dan Achin

04/23/2021, 10:38 PM

let me know what you find out about the timeout handler....if you are looking at that. if it's hardcoded for the distributed endpoints, nothing we do in nginx is going to make any difference

Dan Achin

04/23/2021, 10:42 PM

also, with distributed interval the docs say

--distributed_interval=60

In seconds, the amount of time that osqueryd will wait between periodically checking in with a distributed query server to see if there are any queries to execute. This is how I understood it. Your example above @Seshu says you set this to 0. That seems crazy. how can your clients check in without any interval between subsequent check-ins?

Seshu

04/23/2021, 10:45 PM

🙂 In our config, we don't pause. We always have Osquery waiting for queries. We can send a query to thousands of hosts and get back answer in a second. If we pause, we will have to wait that amount of time to get responses back.

Dan Achin

04/23/2021, 10:45 PM

sure...that's why we set it to 10

Dan Achin

04/23/2021, 10:45 PM

🙂

Dan Achin

04/23/2021, 10:45 PM

ok, make sense I guess

Dan Achin

04/23/2021, 10:46 PM

perhaps that's better from a connection standpoint then

Dan Achin

04/23/2021, 10:46 PM

we should try that

zwass

04/23/2021, 10:47 PM

@Seshu how many RPS does a single osquery instance generate with the interval set to 0? Do you use a strategy in which the server waits to respond to the request until there is a query ready?

Dan Achin

04/23/2021, 10:49 PM

interesting to think about. ya, you'd get a lot of request to the read api with that set to 0...as fast as it can do them? how would that even work?

Seshu

04/23/2021, 10:56 PM

Every distributed query waits for 12 odd seconds before the server sends back empty response. If one or more queries come through, we immediately send them to Osquery on the waiting socket. Osquery immediately sends another request without delay once it gets response from back-end. With keep-alive, it will be on the same connection. Like I mentioned above, it usually is 2 connections per host (1 for DR, 1 for Log). And an occasional config connection.

👍 1

zwass

04/23/2021, 10:56 PM

Ah, interesting strategy! Thanks for sharing 🙂

Dan Achin

04/23/2021, 10:58 PM

ya, that's slick. I like it

zwass

04/23/2021, 11:00 PM

How does your agent behave when offline? Does it use a lot of CPU spamming distributed query connection attempts? Or is this something you changed in your agent?

Seshu

04/23/2021, 11:04 PM

It does exponential back-off...

zwass

04/23/2021, 11:05 PM

Is that osquery functionality or something you changed?

Seshu

04/23/2021, 11:06 PM

I need to check what is in OSS and if it is same as what we have .. 😞

zwass

04/23/2021, 11:07 PM

Ha no worries, I can check osquery. Thanks for the insights!

👍 1

Stefano Bonicatti

04/23/2021, 11:08 PM

https://github.com/osquery/osquery/blob/master/osquery/remote/utility.h#L274 should be this

ty 1

❤️ 1

zwass

04/23/2021, 11:13 PM

Ah nice, thanks Stefano. Makes sense we have that for all requests.

Dan Achin

04/23/2021, 11:13 PM

how many failed attempts until it starts backing off? Considering that our clients (currently) can only communicate to Fleet when on an end-user initiated VPN connection, we definitely don't have persistent connectivity. I don't think we have any problems when

distributed_interval

is set to 10 seconds, so I'm guessing it won't be an issue with 0. I'll have to test this out.

Stefano Bonicatti

04/23/2021, 11:22 PM

Hum so, the backoff happens immediately at the first failure, but the caveat is that the retry attempts for that specific request are configured by a flag that depends on where the request originated from. In the case of the distributed plugin, the default is 3 attempts total (

distributed_tls_max_attempts

) which means, 1 immediate, then if the previous failed, 1 after 1 second, 1 after 4 seconds. After this, the

distributed_interval

kicks in.

Dan Achin

04/23/2021, 11:24 PM

ah, thanks @Stefano Bonicatti. I should have realized as I'm familiar with the max attempts setting

zwass

04/23/2021, 11:26 PM

@Dan Achin with Fleet you'll get a response to a distributed request pretty much immediately so you're going to have osquery making a lot of requests if you set the interval to 0.

Dan Achin

04/23/2021, 11:29 PM

Hmm, it sounds to me like that would actually be less though based on the info above: https://osquery.slack.com/archives/C08V7KTJB/p1619218588257100?thread_ts=1619213725.247500&cid=C08V7KTJB What am I missing?

Dan Achin

04/23/2021, 11:29 PM

well, one every 12 seconds anyways. but a lot less connections

zwass

04/23/2021, 11:30 PM

Fleet doesn't wait to return an empty response.

Dan Achin

04/23/2021, 11:34 PM

oh...was he talking about something other than Fleet? i guess i assumed as I thought I was in the Fleet channel. 🙂

Dan Achin

04/23/2021, 11:35 PM

so then ya...might not be the best idea for us.

zwass

04/23/2021, 11:36 PM

He was talking about the Uptycs commercial project which uses a fork of osquery.

Dan Achin

04/23/2021, 11:37 PM

K, thanks. Looks like we'll need to figure out why we see so many connections from clients to Fleet. The last count I had was around 300 per system and every check-in to the read endpoint is generating a new one.

zwass

04/23/2021, 11:41 PM

I wonder if this could be a bug in the

tls_session_timeout

implementation? Would be good to file an issue with osquery if you can come up with a repro.

🙏 1

Dan Achin

04/23/2021, 11:59 PM

ah...i should have been looking at that setting. I could try setting it to 0 and allowing nginx to manage the sessions

Dan Achin

04/23/2021, 11:59 PM

let me see what I can figure out

Dan Achin

04/27/2021, 4:59 PM

@Seshu - were you ever able to confirm this?

Copy code

I might be wrong, but the timeout handler could be closing the connection after 16 seconds (hand-coded in tls.cpp). Which means keep-alive for distributed connections might not make any difference. Need to double check!!!

I've been tweaking

tls_session_timeout

on our clients and I'm not seeing any difference with #s of connections opened / closed from that. The only thing I'm seeing that reduces connection churn from clients is to increase our

distributed_interval

. I'm thinking I'll open an osquery issue, but wondered if you confirmed a hard-coded timeout in tls.cpp.

Seshu

04/27/2021, 5:01 PM

Sorry. Haven't looked into it @Dan Achin. Don't have the OSS Osquery setup handy. Will try to get to it...

Dan Achin

04/27/2021, 5:04 PM

ok, thanks. let me know if you get to it, would appreciate it. I'll open a git issue today at some point.

Dan Achin

04/28/2021, 12:26 AM

Created https://github.com/osquery/osquery/issues/7083

Seshu

04/28/2021, 12:31 AM

Thanks. Will look into this tonight...

🙏 1

Seshu

04/28/2021, 9:26 PM

@Stefano Bonicatti link to http_client.cpp:339 is causing socket to close all the time. At least I see that all log requests are hitting the code path. Both

new_client_options_

and

client_options_.ssl_connection_ != ssl_connection

are

true

for TLS logger

🙏 1

Dan Achin

04/30/2021, 9:15 PM

@Stefano Bonicatti - would appreciate your feedback on this. Thanks!

Open in Slack

Previous Next