Title
#fleet
Dan Achin

Dan Achin

03/29/2021, 5:17 PM
Hello Fleet - has anyone here seen high rates of 400 errors logged in Fleet nginx for POSTS to the logging endpoint? We have a lot of clients that are never able to send data to Fleet as every POST from them is a 400 and we are wondering if this is somehow related to or the cause of massive, unexpected traffic from some of our clients to Fleet (as clients are retrying to resend the data). There's a write up of the issue here:https://github.com/osquery/osquery/issues/7021#issuecomment-808569110
5:34 PM
Also, are their any best practices for Fleet's nginx config? Maybe we need larger headers enabled or something
Noah Talerman

Noah Talerman

03/29/2021, 5:38 PM
Hi Dan, this writeup includes guidance on Fleet and Nginx: https://defensivedepth.com/2020/04/02/kolide-fleet-breaking-out-the-osquery-api-web-ui/
5:41 PM
In the osquery/osquery issue writeup you’ve linked, you confirm that Fleet is configured with
--logger_tls_endpoint
as
/api/v1/log
in your latest comment. Not
/api/v1/osquery/log
. Is this correct?
Dan Achin

Dan Achin

03/29/2021, 5:59 PM
oh...let me confirm that. thanks for the link
6:00 PM
sorry, it's /api/v1/osquery/log
6:00 PM
i'll fix that in the comment, good catch
6:01 PM
ya, I've read that link before...and we plan to do that so that we can have the tls read / write endpoints for our laptops exposed over the internet without allowing external access to administrative apis
6:05 PM
Also, I could have sworn fleet came with its own nginx...but maybe I'm wrong about that
6:21 PM
@Noah Talerman - do you know if osquery will continue to try to send logs to Fleet forever if it's not getting a 200 response? I see max attempts for config
--config_tls_max_attempts=3
and max attempts for distributed
--distributed_tls_max_attempts=3
But I don't see anything that seems to control how long clients will retry to send to Fleet. If that doesn't exist, then it supports our theory that our clients are retrying over and over and over
Noah Talerman

Noah Talerman

03/29/2021, 7:20 PM
do you know if osquery will continue to try to send logs to Fleet forever if it’s not getting a 200 response?
I don’t have an immediate answer to this question. Working on getting an answer now.
Dan Achin

Dan Achin

03/29/2021, 7:31 PM
thanks!
Noah Talerman

Noah Talerman

03/29/2021, 7:35 PM
When the logger endpoint responds with a 400 status, the logs that were attempted to be sent will be buffered on the client. Fleet uses the
--logger_tls_period
option to determine the number of seconds before checking for buffered logs on the client. So, the osquery client will attempt to send logs again and again to Fleet at the frequency of this interval
Dan Achin

Dan Achin

03/29/2021, 7:36 PM
Right, that matches my understanding. so there is no setting for how many times to try before giving up...it will always keep trying
7:37 PM
at least until the DB is cleared out
7:37 PM
i.e. the buffered logs
Noah Talerman

Noah Talerman

03/29/2021, 7:42 PM
it will always keep trying
Correct
7:42 PM
I now see how the client trying over and over again is undesirable with your setup. What was your ideal next step when you encountered this issue? Was it to find a way to inform these osquery clients to give up?
Dan Achin

Dan Achin

03/29/2021, 8:48 PM
Sorry @Noah Talerman, went and got lunch. It's not so much that it's undesirable, I just wanted to confirm the behavior. we are looking at everything we possibly can to try and figure out why we have clients sending GB of data every hour but hardly any of it makes it to Fleet
8:53 PM
as I'm looking through /var/log/messages on Fleet, I see some EOF errors, like this:
2021-03-29T18:42:11.682014+00:00 servername REDACTED fleet[11900]: {"component":"http","err":"decoding JSON: unexpected EOF","ts":"2021-03-29T18:42:11.68104674Z"}
I'm also seeing a lot of issues with invalid node keys, missing node keys, and logs where we have clients 'enrolling too often'
8:57 PM
We need to upgrade this env to 3.9.0 to see if that might clear up some of the duplicate / invalid node keys
10:01 PM
Turned on debug logging at Fleet and not seeing anything in /var/log/messages that corresponds to the 400s I see in nginx access logs. So that's a data point for the 400 being generated by nginx and not upstream from fleet
👍 1
Noah Talerman

Noah Talerman

03/29/2021, 10:45 PM
we are looking at everything we possibly can to try and figure out why we have clients sending GB of data every hour but hardly any of it makes it to Fleet
Got it. Thank you for providing your updated findings. The configurable host identifier included in 3.9.0 may be helpful for duplicate enrollment. Attempting to get a better answer for why you’re seeing the EOF errors
Dan Achin

Dan Achin

03/29/2021, 11:22 PM
awesome, thanks
s

seph

03/29/2021, 11:26 PM
You were thinking about nginx logs? Any chance you’ve looked at them?
11:26 PM
decoding JSON: unexpected EOF
sounds like nginx is truncating
Dan Achin

Dan Achin

03/29/2021, 11:37 PM
yes...we have been looking at them @seph. mostly we are only getting 200s and 400s....we aren't getting any error logs so I have my team looking at that
11:38 PM
my first thought was that those EOFs could be the cause of the 400s...but we only see a small-ish amount of the EOFs, much less than the occurrences of 400s we see.
9:10 PM
@Noah Talerman @zwass - good news. Reduced the buffered_log_max significantly and within an hour, the bandwidth usage dropped an order of magnitude. Additionally, we are no longer seeing significant #s of 400 errors in nginx, which now leads me to believe that the message body size that nginx accepts needs to be increased. Cautiously optimistic that we are going to be able to resolve this soon.
zwass

zwass

03/30/2021, 9:25 PM
Glad to hear it!
🙏 1