Hello Fleet - has anyone here seen high rates of 4...
# fleet
d
Hello Fleet - has anyone here seen high rates of 400 errors logged in Fleet nginx for POSTS to the logging endpoint? We have a lot of clients that are never able to send data to Fleet as every POST from them is a 400 and we are wondering if this is somehow related to or the cause of massive, unexpected traffic from some of our clients to Fleet (as clients are retrying to resend the data). There's a write up of the issue here: https://github.com/osquery/osquery/issues/7021#issuecomment-808569110
Also, are their any best practices for Fleet's nginx config? Maybe we need larger headers enabled or something
n
Hi Dan, this writeup includes guidance on Fleet and Nginx: https://defensivedepth.com/2020/04/02/kolide-fleet-breaking-out-the-osquery-api-web-ui/
In the osquery/osquery issue writeup you’ve linked, you confirm that Fleet is configured with
--logger_tls_endpoint
as
/api/v1/log
in your latest comment. Not
/api/v1/osquery/log
. Is this correct?
d
oh...let me confirm that. thanks for the link
sorry, it's /api/v1/osquery/log
i'll fix that in the comment, good catch
ya, I've read that link before...and we plan to do that so that we can have the tls read / write endpoints for our laptops exposed over the internet without allowing external access to administrative apis
Also, I could have sworn fleet came with its own nginx...but maybe I'm wrong about that
@Noah Talerman - do you know if osquery will continue to try to send logs to Fleet forever if it's not getting a 200 response? I see max attempts for config
--config_tls_max_attempts=3
and max attempts for distributed
--distributed_tls_max_attempts=3
But I don't see anything that seems to control how long clients will retry to send to Fleet. If that doesn't exist, then it supports our theory that our clients are retrying over and over and over
n
do you know if osquery will continue to try to send logs to Fleet forever if it’s not getting a 200 response?
I don’t have an immediate answer to this question. Working on getting an answer now.
d
thanks!
n
When the logger endpoint responds with a 400 status, the logs that were attempted to be sent will be buffered on the client. Fleet uses the
--logger_tls_period
option to determine the number of seconds before checking for buffered logs on the client. So, the osquery client will attempt to send logs again and again to Fleet at the frequency of this interval
d
Right, that matches my understanding. so there is no setting for how many times to try before giving up...it will always keep trying
at least until the DB is cleared out
i.e. the buffered logs
n
it will always keep trying
Correct
I now see how the client trying over and over again is undesirable with your setup. What was your ideal next step when you encountered this issue? Was it to find a way to inform these osquery clients to give up?
d
Sorry @Noah Talerman, went and got lunch. It's not so much that it's undesirable, I just wanted to confirm the behavior. we are looking at everything we possibly can to try and figure out why we have clients sending GB of data every hour but hardly any of it makes it to Fleet
as I'm looking through /var/log/messages on Fleet, I see some EOF errors, like this:
2021-03-29T18:42:11.682014+00:00 servername REDACTED fleet[11900]: {"component":"http","err":"decoding JSON: unexpected EOF","ts":"2021-03-29T18:42:11.68104674Z"}
I'm also seeing a lot of issues with invalid node keys, missing node keys, and logs where we have clients 'enrolling too often'
We need to upgrade this env to 3.9.0 to see if that might clear up some of the duplicate / invalid node keys
Turned on debug logging at Fleet and not seeing anything in /var/log/messages that corresponds to the 400s I see in nginx access logs. So that's a data point for the 400 being generated by nginx and not upstream from fleet
👍 1
n
we are looking at everything we possibly can to try and figure out why we have clients sending GB of data every hour but hardly any of it makes it to Fleet
Got it. Thank you for providing your updated findings. The configurable host identifier included in 3.9.0 may be helpful for duplicate enrollment. Attempting to get a better answer for why you’re seeing the EOF errors
d
awesome, thanks
s
You were thinking about nginx logs? Any chance you’ve looked at them?
decoding JSON: unexpected EOF
sounds like nginx is truncating
d
yes...we have been looking at them @seph. mostly we are only getting 200s and 400s....we aren't getting any error logs so I have my team looking at that
my first thought was that those EOFs could be the cause of the 400s...but we only see a small-ish amount of the EOFs, much less than the occurrences of 400s we see.
@Noah Talerman @zwass - good news. Reduced the buffered_log_max significantly and within an hour, the bandwidth usage dropped an order of magnitude. Additionally, we are no longer seeing significant #s of 400 errors in nginx, which now leads me to believe that the message body size that nginx accepts needs to be increased. Cautiously optimistic that we are going to be able to resolve this soon.
z
Glad to hear it!
🙏 1