Hello Fleet has anyone here seen high rates of 400 errors lo osquery #fleet

Hello Fleet - has anyone here seen high rates of 4...

Dan Achin

03/29/2021, 5:17 PM

Hello Fleet - has anyone here seen high rates of 400 errors logged in Fleet nginx for POSTS to the logging endpoint? We have a lot of clients that are never able to send data to Fleet as every POST from them is a 400 and we are wondering if this is somehow related to or the cause of massive, unexpected traffic from some of our clients to Fleet (as clients are retrying to resend the data). There's a write up of the issue here: https://github.com/osquery/osquery/issues/7021#issuecomment-808569110

Dan Achin

03/29/2021, 5:34 PM

Also, are their any best practices for Fleet's nginx config? Maybe we need larger headers enabled or something

Noah Talerman

03/29/2021, 5:38 PM

Hi Dan, this writeup includes guidance on Fleet and Nginx: https://defensivedepth.com/2020/04/02/kolide-fleet-breaking-out-the-osquery-api-web-ui/

Noah Talerman

03/29/2021, 5:41 PM

In the osquery/osquery issue writeup you’ve linked, you confirm that Fleet is configured with

--logger_tls_endpoint

/api/v1/log

in your latest comment. Not

/api/v1/osquery/log

. Is this correct?

Dan Achin

03/29/2021, 5:59 PM

oh...let me confirm that. thanks for the link

Dan Achin

03/29/2021, 6:00 PM

sorry, it's /api/v1/osquery/log

Dan Achin

03/29/2021, 6:00 PM

i'll fix that in the comment, good catch

Dan Achin

03/29/2021, 6:01 PM

ya, I've read that link before...and we plan to do that so that we can have the tls read / write endpoints for our laptops exposed over the internet without allowing external access to administrative apis

Dan Achin

03/29/2021, 6:05 PM

Also, I could have sworn fleet came with its own nginx...but maybe I'm wrong about that

Dan Achin

03/29/2021, 6:21 PM

@Noah Talerman - do you know if osquery will continue to try to send logs to Fleet forever if it's not getting a 200 response? I see max attempts for config

--config_tls_max_attempts=3

and max attempts for distributed

--distributed_tls_max_attempts=3

But I don't see anything that seems to control how long clients will retry to send to Fleet. If that doesn't exist, then it supports our theory that our clients are retrying over and over and over

Noah Talerman

03/29/2021, 7:20 PM

do you know if osquery will continue to try to send logs to Fleet forever if it’s not getting a 200 response?

I don’t have an immediate answer to this question. Working on getting an answer now.

Dan Achin

03/29/2021, 7:31 PM

thanks!

Noah Talerman

03/29/2021, 7:35 PM

When the logger endpoint responds with a 400 status, the logs that were attempted to be sent will be buffered on the client. Fleet uses the

--logger_tls_period

option to determine the number of seconds before checking for buffered logs on the client. So, the osquery client will attempt to send logs again and again to Fleet at the frequency of this interval

Dan Achin

03/29/2021, 7:36 PM

Right, that matches my understanding. so there is no setting for how many times to try before giving up...it will always keep trying

Dan Achin

03/29/2021, 7:37 PM

at least until the DB is cleared out

Dan Achin

03/29/2021, 7:37 PM

i.e. the buffered logs

Noah Talerman

03/29/2021, 7:42 PM

it will always keep trying

Correct

Noah Talerman

03/29/2021, 7:42 PM

I now see how the client trying over and over again is undesirable with your setup. What was your ideal next step when you encountered this issue? Was it to find a way to inform these osquery clients to give up?

Dan Achin

03/29/2021, 8:48 PM

Sorry @Noah Talerman, went and got lunch. It's not so much that it's undesirable, I just wanted to confirm the behavior. we are looking at everything we possibly can to try and figure out why we have clients sending GB of data every hour but hardly any of it makes it to Fleet

Dan Achin

03/29/2021, 8:53 PM

as I'm looking through /var/log/messages on Fleet, I see some EOF errors, like this:

2021-03-29T18:42:11.682014+00:00 servername REDACTED fleet[11900]: {"component":"http","err":"decoding JSON: unexpected EOF","ts":"2021-03-29T18:42:11.68104674Z"}

I'm also seeing a lot of issues with invalid node keys, missing node keys, and logs where we have clients 'enrolling too often'

Dan Achin

03/29/2021, 8:57 PM

We need to upgrade this env to 3.9.0 to see if that might clear up some of the duplicate / invalid node keys

Dan Achin

03/29/2021, 10:01 PM

Turned on debug logging at Fleet and not seeing anything in /var/log/messages that corresponds to the 400s I see in nginx access logs. So that's a data point for the 400 being generated by nginx and not upstream from fleet

👍 1

Noah Talerman

03/29/2021, 10:45 PM

we are looking at everything we possibly can to try and figure out why we have clients sending GB of data every hour but hardly any of it makes it to Fleet

Got it. Thank you for providing your updated findings. The configurable host identifier included in 3.9.0 may be helpful for duplicate enrollment. Attempting to get a better answer for why you’re seeing the EOF errors

Dan Achin

03/29/2021, 11:22 PM

awesome, thanks

seph

03/29/2021, 11:26 PM

You were thinking about nginx logs? Any chance you’ve looked at them?

seph

03/29/2021, 11:26 PM

decoding JSON: unexpected EOF

sounds like nginx is truncating

Dan Achin

03/29/2021, 11:37 PM

yes...we have been looking at them @seph. mostly we are only getting 200s and 400s....we aren't getting any error logs so I have my team looking at that

Dan Achin

03/29/2021, 11:38 PM

my first thought was that those EOFs could be the cause of the 400s...but we only see a small-ish amount of the EOFs, much less than the occurrences of 400s we see.

Dan Achin

03/30/2021, 9:10 PM

@Noah Talerman @zwass - good news. Reduced the buffered_log_max significantly and within an hour, the bandwidth usage dropped an order of magnitude. Additionally, we are no longer seeing significant #s of 400 errors in nginx, which now leads me to believe that the message body size that nginx accepts needs to be increased. Cautiously optimistic that we are going to be able to resolve this soon.

zwass

03/30/2021, 9:25 PM

Glad to hear it!

🙏 1

8 Views

Open in Slack

Previous Next