Hi Team. I would like feedback on a potential back...
# core
v
Hi Team. I would like feedback on a potential backoff feature/switch for tls_logger plugin. It can also be applied for AWS logger plugin. Issue: When TLS logging endpoint is down or having issues, I would like osquery to automatically backoff from sending more logs. Proposed solution: • Add
--logger_tls_backoff=true
switch. • With the above switch, assuming
--logger_tls_period=3
and unsuccessful requests, the next request will happen in 3^1=3 minutes, the next request will happen in 3^2=9 minutes, the next request in 3^3=27 minutes, and so forth until a fixed maximum. • The fixed maximum will be 3 hours, but this is up for discussion. The maximum can also be a switch. Current workaround: I can adjust the tls logger switches manually to reduce frequency/size of the logging requests. However, I would like a built-in solution.
s
The backoff makes sense to me (nit: The current period is seconds, not minutes, but logic is clearly the same); I wonder if we should also introduce a way to dismiss whatever it's left of the backoff time. I can envision a situation where the service goes down due to some increased activity but caused by something else going on the servers. Then it goes back up, but it doesn't have all the availability that it would need to handle the osquery logger activity it had earlier, so with a high backoff time, it might have enough time to recover (and other stuff to finish), avoiding to go down again. At the same time I can see the service going up and the end user expecting logs to come in again almost immediately, not with hours of delay.
I'm also trying to process if this relates and how to something I mentioned in a past office hours. If the server goes down due high activity, the user might be surprised by the volume of data coming in, when it goes up, since osquery has accumulated more logs to send each period than what it would normally send when everything worked fine. And not only due to that, but due to the limited knobs we have to limit the flow.
v
Yes, this is directly related to that. Many logging integrations have bandwidth limits. If the user wants to force restart the logs, maybe they can write
--logger_tls_backoff=false
?
s
Changing the flag makes sense, but I see several way to implement it (and if it continuosly sleeps for 3h you need something to wake it up), but if instead there's still an internal poll, that just skips sending until the delay has passed, then it works.
šŸ‘ 1
I mean, I did not make an assumption on how you were going to implement it exactly šŸ™‚
v