Hi <@U013P6HTZA5>, I sort of understand the behavi...
# core
t
Hi @Julian Scala, I sort of understand the behavior you are asking for, but I want to clarify. What do you want the agents to do exactly in the event a logger plugin is not responding?
j
Thanks for the response! I want them to do nothing haha, specifically to NOT store/cache results until logger plugin is back online/ consuming
I mean, discard results on scheduled queries until the event logger plugin responds I guess.
t
I see, I hope you don’t mind me saying that is quite opposite of what people normally want/expect.
j
Haha no worries, but just to give a little bit of context, we don’t want the devices to store information that cant be sent (unless there is a limit that can be set). At the same time, we don’t want our backend services to get bumped on records once they go back online. 😄
t
You may be able to accomplish this today with the buffered logger options. The buffered logger is sort of the “base class” for multiple remote logging plugins.
j
Is
--buffered_log_max=10
flag used for this? I understand that
10
logs is the max count the device/agent can hold, if more logs are buffered the older ones will be removed. Is that the way it works? We do have this set to
0
meaning that there is no limit on this buffer and everything will be kept. Pls correct me if I am wrong
z
Yes. IIRC that buffer is only cleared after a successful or failed log attempt so setting it to something like 1 might effectively do what you want?
parrotbeer 1
j
I think it is! This is amazing, thanks a lot for your help!🚀
s
Can you say more about what leads you to this use case? I’m not sure I’ve encountered it before, so I’d love to hear about what’s behind it
j
Yes! We output logs of a huge device fleet to a AWS Kinesis data stream. This past week, there was an outage on AWS Kinesis, causing the streams not to respond or receive any record. Every device we manage stored every result for more than 24hs (we have a lot of snapshot queries in short intervals). As you can imagine, by the time the Kinesis was up again, every device sort of ‘puke’ every record they had. Thanks we have a really good backend service processing those results but it got really smashed. Not to mention devices losing HD space by accumulating every result log.
We want to avoid this kind of situations again, just discard the data.
s
Okay! I kinda get that, but I’d probably come at it a bit differently… This is a common problem in modern microservices. One method is to have there be some kind of rate limiting or circuit breaker to avoid a large ingest melting things. “backpressure” is another thing. Though I’d generally expect kinesis to be able to handle anything you throw at it.
If you’re willing to just toss aside this data, how much value does collecting it have?
j
Never thought Kinesis could ever be down, but it happened. Things didn’t melt, only costed more money for a bit of time. We don’t collect the data, we just process records in order to have the actual state of devices. We toss aside past data since we don’t really care on historic changes. Just the actual state.
s
Ah. That makes sense. If you’re just using it as a “current state” sort of thing, there’s little value in the past
z
@Julian Scala I wonder if it might be as/more effective to run a TLS server that issues the same queries as live queries. Sounds like you're not using the main benefits of scheduled queries (differentials and offline results).
j
Ah, good point! We used Kolide with TLS loggers and at some point things started to get expensive with our implementation. Maybe due some poor configuration on both server/agents. Kinesis was a good call. Tripled the amount of results for less than half the cost. Now I wonder how effective could be to output live queries results to a data stream, but I understand it is not supported by official plugins. Still we have a lot of stuff to try out, our pipeline keeps evolving!