Hi guys Need help for setup of osquery We have got a problem osquery #general

Hi guys, Need help for setup of osquery. We have ...

Denis

11/17/2023, 10:18 AM

Hi guys, Need help for setup of osquery. We have got a problem with a huge traffic to our infra when our endpoint /api/v1/logger is failed for any reasons. Does it have settings or flags for controlling failed retries attempts? We got a 1.5Gbit/sec traffic from 5k machines just for 1 hour and this load was increasing. Thanks in advance.

Denis

11/17/2023, 10:27 AM

Our /api/v1/logger over two load balancers. And the first load balancer can't response error for osquery...And are always trying to get POST requests from machines - this main reason of huge traffic.

Stefano Bonicatti

11/17/2023, 11:08 AM

Hello @Denis, the logger mechanism is a bit different from the one of the config or distributed, so there is no separate retry mechanism, simply if the batch of logs has failed to be sent, in the next normal logger period will be sent again. The period is controlled by

--logger_tls_period

Stefano Bonicatti

11/17/2023, 11:10 AM

"period" I think it's a bit of an unfortunate name, but it's more of a fixed delay in between sends.

Stefano Bonicatti

11/17/2023, 11:18 AM

That being said, when the endpoint fails, do the clients receive a TCP connection error, or the error is via HTTP?

Denis

11/17/2023, 11:24 AM

Hello @Stefano Bonicatti, thanks for answer. I read the docs regarding logger_tls_period and logger at all. And I didn't found solution this reason why I asked there) Clients received the 50x error only when full post request was sended.

Stefano Bonicatti

11/17/2023, 11:25 AM

When everything is working, what's the bandwidth used? If it's less than that, then it means that you have to tune the TLS logger to limit how much it sends each period, because you might be experiencing that the default seems ok, just because the log queue is emptied fast enough, but the default max limit is actually not ok for you.

Denis

11/17/2023, 11:29 AM

When everything is working, what's the bandwidth used?

About 10-20Mbit/sec

Stefano Bonicatti

11/17/2023, 11:29 AM

there's also

--logger_tls_max_lines

that can help with the amount of data sent, which by default is

Stefano Bonicatti

11/17/2023, 11:32 AM

About 10-20Mbit/sec

I see, so I suspect that you might be in the situation I mentioned, if you haven't already reduced that other parameter. My guess again is that you may send 1/100th of that, because that's the rate at which logs gets generated by osquery normally, but when things don't work, they accumulate in the DB, and so each batch gets bigger because the batch max size has not been tuned.

Denis

11/17/2023, 12:25 PM

Yes, this is true. logger_tls_max_lines is default. Could you please advice amount of logger_tls_max_lines for our cases, if it possible)

Denis

11/17/2023, 12:41 PM

Also, could you explain me about 1 line of tls_logger? Is it just 1 response for 1 scheduled query?

Stefano Bonicatti

11/17/2023, 12:41 PM

> Could you please advice amount of logger_tls_max_lines for our cases, if it possible) Well that's not possible for me to say. You have to evaluate that for your deployments. I would start from what's the bandwidth limit you're ready to support, and then go from there. With the period and the maximum amount of lines (plus the average size of each log line, which only you would know) you can calculate what's the theoretical average you would get if sending logs takes no time (beyond the fixed delay configured by the flag). You can also calculate what's the maximum, because there's a third flag which controls the maximum line size,

logger_tls_max_linesize

, which by default is 1MiB. But this might make less sense, because while it's true you could theorically get each line to be of that size, 1MiB is big (and unlikely I believe) Finally be careful to not backlog your clients, especially if you increase the period and/or reduce the amount of lines sent too much, because then the client RocksDB database will slowly grow and become slower. There's a limit even there,

buffered_log_max

, which if hit the DB will start dropping old logs. It's quite high currently (1M entries). Keep in mind though that dropping logs might also cause further slow downs, because the DB has to do work to do so, so this mechanism is only ok if the threshold being hit is only temporary.

Stefano Bonicatti

11/17/2023, 1:28 PM

> Also, could you explain me about 1 line of tls_logger? > Is it just 1 response for 1 scheduled query? Admittedly this part is a bit confusing. A log line can be a status log line, (the logs you can also see if you launched osquery in foreground on a terminal), or they could be results. For results, if there's the differential active for that query, a line should be one

added

removed

line (so one line is one row that's different, be it added or removed. This is normal for events). For snapshot queries instead a line is the whole query result. While there's such a huge range of size, I would tune the amount of lines thinking more to the events/differential (which are the ones that can grow on that axis). While the line size can indeed get big but for snapshot queries

Stefano Bonicatti

11/17/2023, 1:31 PM

Although for snapshot queries on non evented tables, I would expect it to not grow too much. Like, a query on the

processes

table can become bigger because there are more processes at that time, or maybe because the paths to the binaries are longer (but there would need to be a lot of them). That still requires a bit of knowledge and collaboration on who's writing the queries.

Denis

11/17/2023, 1:39 PM

Thank you so much for explanation!

Stefano Bonicatti

11/17/2023, 1:42 PM

Sorry I should correct myself, because I was thinking of the normal case, not the backlogged case. If you get backlogged, you can start accumulating snapshot results, and you then would send multiple "lines" which can be big. So while keeping a high number in

logger_tls_max_lines

is correct for events, can be problematic for snapshot queries. In hindsight osquery should've had a single flag with a bandwidth target to reach, and osquery addressing how much "lines" to fit in the batch automatically.

Stefano Bonicatti

11/17/2023, 1:58 PM

This might be something to open an issue about, because if you have high traffic machines, sending a lot of events and a good amount of snapshot queries, you have to keep that max_lines flag high to not backlog the DB with the event log lines, but at the same time if you have issues with the network, it will start sending more and more snapshot lines, which are quite bigger.

Stefano Bonicatti

11/17/2023, 1:59 PM

And I don't have a good solution right now; I realized this a bit late in the discussion.

🫶 1

6 Views

Open in Slack

Previous Next