Hi < Julian Scala> I sort of understand the behavior you are osquery #core

Hi <@U013P6HTZA5>, I sort of understand the behavi...

theopolis

11/26/2020, 6:30 PM

Hi @Julian Scala, I sort of understand the behavior you are asking for, but I want to clarify. What do you want the agents to do exactly in the event a logger plugin is not responding?

Julian Scala

11/26/2020, 6:32 PM

Thanks for the response! I want them to do nothing haha, specifically to NOT store/cache results until logger plugin is back online/ consuming

Julian Scala

11/26/2020, 6:33 PM

I mean, discard results on scheduled queries until the event logger plugin responds I guess.

theopolis

11/26/2020, 6:54 PM

I see, I hope you don’t mind me saying that is quite opposite of what people normally want/expect.

Julian Scala

11/26/2020, 6:57 PM

Haha no worries, but just to give a little bit of context, we don’t want the devices to store information that cant be sent (unless there is a limit that can be set). At the same time, we don’t want our backend services to get bumped on records once they go back online. 😄

theopolis

11/26/2020, 6:57 PM

You may be able to accomplish this today with the buffered logger options. The buffered logger is sort of the “base class” for multiple remote logging plugins.

Julian Scala

11/26/2020, 7:00 PM

--buffered_log_max=10

flag used for this? I understand that

logs is the max count the device/agent can hold, if more logs are buffered the older ones will be removed. Is that the way it works? We do have this set to

meaning that there is no limit on this buffer and everything will be kept. Pls correct me if I am wrong

zwass

11/26/2020, 8:10 PM

Yes. IIRC that buffer is only cleared after a successful or failed log attempt so setting it to something like 1 might effectively do what you want?

parrotbeer 1

Julian Scala

11/26/2020, 8:13 PM

I think it is! This is amazing, thanks a lot for your help!🚀

seph

11/27/2020, 2:40 AM

Can you say more about what leads you to this use case? I’m not sure I’ve encountered it before, so I’d love to hear about what’s behind it

Julian Scala

11/27/2020, 2:26 PM

Yes! We output logs of a huge device fleet to a AWS Kinesis data stream. This past week, there was an outage on AWS Kinesis, causing the streams not to respond or receive any record. Every device we manage stored every result for more than 24hs (we have a lot of snapshot queries in short intervals). As you can imagine, by the time the Kinesis was up again, every device sort of ‘puke’ every record they had. Thanks we have a really good backend service processing those results but it got really smashed. Not to mention devices losing HD space by accumulating every result log.

Julian Scala

11/27/2020, 2:30 PM

We want to avoid this kind of situations again, just discard the data.

seph

11/27/2020, 7:20 PM

Okay! I kinda get that, but I’d probably come at it a bit differently… This is a common problem in modern microservices. One method is to have there be some kind of rate limiting or circuit breaker to avoid a large ingest melting things. “backpressure” is another thing. Though I’d generally expect kinesis to be able to handle anything you throw at it.

seph

11/27/2020, 7:20 PM

If you’re willing to just toss aside this data, how much value does collecting it have?

Julian Scala

11/27/2020, 7:49 PM

Never thought Kinesis could ever be down, but it happened. Things didn’t melt, only costed more money for a bit of time. We don’t collect the data, we just process records in order to have the actual state of devices. We toss aside past data since we don’t really care on historic changes. Just the actual state.

seph

11/28/2020, 1:21 AM

Ah. That makes sense. If you’re just using it as a “current state” sort of thing, there’s little value in the past

zwass

11/28/2020, 3:58 AM

@Julian Scala I wonder if it might be as/more effective to run a TLS server that issues the same queries as live queries. Sounds like you're not using the main benefits of scheduled queries (differentials and offline results).

Julian Scala

11/30/2020, 12:58 AM

Ah, good point! We used Kolide with TLS loggers and at some point things started to get expensive with our implementation. Maybe due some poor configuration on both server/agents. Kinesis was a good call. Tripled the amount of results for less than half the cost. Now I wonder how effective could be to output live queries results to a data stream, but I understand it is not supported by official plugins. Still we have a lot of stuff to try out, our pipeline keeps evolving!

6 Views

Open in Slack

Previous Next