Hi guys, Need help for setup of osquery. We have ...
# general
d
Hi guys, Need help for setup of osquery. We have got a problem with a huge traffic to our infra when our endpoint /api/v1/logger is failed for any reasons. Does it have settings or flags for controlling failed retries attempts? We got a 1.5Gbit/sec traffic from 5k machines just for 1 hour and this load was increasing. Thanks in advance.
Our /api/v1/logger over two load balancers. And the first load balancer can't response error for osquery...And are always trying to get POST requests from machines - this main reason of huge traffic.
s
Hello @Denis, the logger mechanism is a bit different from the one of the config or distributed, so there is no separate retry mechanism, simply if the batch of logs has failed to be sent, in the next normal logger period will be sent again. The period is controlled by
--logger_tls_period
.
"period" I think it's a bit of an unfortunate name, but it's more of a fixed delay in between sends.
That being said, when the endpoint fails, do the clients receive a TCP connection error, or the error is via HTTP?
d
Hello @Stefano Bonicatti, thanks for answer. I read the docs regarding logger_tls_period and logger at all. And I didn't found solution this reason why I asked there) Clients received the 50x error only when full post request was sended.
s
When everything is working, what's the bandwidth used? If it's less than that, then it means that you have to tune the TLS logger to limit how much it sends each period, because you might be experiencing that the default seems ok, just because the log queue is emptied fast enough, but the default max limit is actually not ok for you.
d
When everything is working, what's the bandwidth used?
About 10-20Mbit/sec
s
there's also
--logger_tls_max_lines
that can help with the amount of data sent, which by default is
1024
.
About 10-20Mbit/sec
I see, so I suspect that you might be in the situation I mentioned, if you haven't already reduced that other parameter. My guess again is that you may send 1/100th of that, because that's the rate at which logs gets generated by osquery normally, but when things don't work, they accumulate in the DB, and so each batch gets bigger because the batch max size has not been tuned.
d
Yes, this is true. logger_tls_max_lines is default. Could you please advice amount of logger_tls_max_lines for our cases, if it possible)
Also, could you explain me about 1 line of tls_logger? Is it just 1 response for 1 scheduled query?
s
> Could you please advice amount of logger_tls_max_lines for our cases, if it possible) Well that's not possible for me to say. You have to evaluate that for your deployments. I would start from what's the bandwidth limit you're ready to support, and then go from there. With the period and the maximum amount of lines (plus the average size of each log line, which only you would know) you can calculate what's the theoretical average you would get if sending logs takes no time (beyond the fixed delay configured by the flag). You can also calculate what's the maximum, because there's a third flag which controls the maximum line size,
logger_tls_max_linesize
, which by default is 1MiB. But this might make less sense, because while it's true you could theorically get each line to be of that size, 1MiB is big (and unlikely I believe) Finally be careful to not backlog your clients, especially if you increase the period and/or reduce the amount of lines sent too much, because then the client RocksDB database will slowly grow and become slower. There's a limit even there,
buffered_log_max
, which if hit the DB will start dropping old logs. It's quite high currently (1M entries). Keep in mind though that dropping logs might also cause further slow downs, because the DB has to do work to do so, so this mechanism is only ok if the threshold being hit is only temporary.
> Also, could you explain me about 1 line of tls_logger? > Is it just 1 response for 1 scheduled query? Admittedly this part is a bit confusing. A log line can be a status log line, (the logs you can also see if you launched osquery in foreground on a terminal), or they could be results. For results, if there's the differential active for that query, a line should be one
added
or
removed
line (so one line is one row that's different, be it added or removed. This is normal for events). For snapshot queries instead a line is the whole query result. While there's such a huge range of size, I would tune the amount of lines thinking more to the events/differential (which are the ones that can grow on that axis). While the line size can indeed get big but for snapshot queries
Although for snapshot queries on non evented tables, I would expect it to not grow too much. Like, a query on the
processes
table can become bigger because there are more processes at that time, or maybe because the paths to the binaries are longer (but there would need to be a lot of them). That still requires a bit of knowledge and collaboration on who's writing the queries.
d
Thank you so much for explanation!
s
Sorry I should correct myself, because I was thinking of the normal case, not the backlogged case. If you get backlogged, you can start accumulating snapshot results, and you then would send multiple "lines" which can be big. So while keeping a high number in
logger_tls_max_lines
is correct for events, can be problematic for snapshot queries. In hindsight osquery should've had a single flag with a bandwidth target to reach, and osquery addressing how much "lines" to fit in the batch automatically.
This might be something to open an issue about, because if you have high traffic machines, sending a lot of events and a good amount of snapshot queries, you have to keep that max_lines flag high to not backlog the DB with the event log lines, but at the same time if you have issues with the network, it will start sending more and more snapshot lines, which are quite bigger.
And I don't have a good solution right now; I realized this a bit late in the discussion.
🫶 1