Hi guys In the seek of your help again is there any way we c osquery #fleet

Hi, guys. In the seek of your help again: is there...

Francisco Huerta

04/22/2021, 7:07 PM

Hi, guys. In the seek of your help again: is there any way we can instruct the agents from Fleet to purge the RocksDB contents? Following with our tests we see a number of agents that have become unstable and we suspect it might be related to their internal database. Is there any way we could handle this cleanup centrally? Any advise or similar experiences?

Noah Talerman

04/22/2021, 7:16 PM

Hi @Francisco Huerta. What do you mean when you use the phrase “become unstable?” Are these agents failing to communicate with Fleet?

Francisco Huerta

04/22/2021, 7:27 PM

Thanks, @Noah Talerman. We see errors from those agents trying to establish a connection to send data to the Fleet for example

Noah Talerman

04/22/2021, 7:35 PM

Got it. Do you mind sharing the errors you’re seeing for those agents in this thread?

Francisco Huerta

04/22/2021, 7:50 PM

👍 1

Francisco Huerta

04/22/2021, 7:50 PM

there you go

Noah Talerman

04/22/2021, 9:24 PM

Thank you! I don’t have an immediate answer to your question on purging RocksDB via Fleet. Working on getting an answer for you now.

zwass

04/22/2021, 9:33 PM

If you want to fully purge the rocksdb database you'd need to delete the database directory on the host. Can you run

--verbose --tls_dump

on one of those hosts experiencing the issue and see if you can tell what they are sending?

Francisco Huerta

04/22/2021, 9:59 PM

thanks both. yep, we're familiarized with purging the contents of the directory but we were looking at something automatic and/or centralized

Francisco Huerta

04/22/2021, 9:59 PM

thanks @zwass @Noah Talerman

zwass

04/22/2021, 10:15 PM

This seems like a feature that could be useful to add to Orbit

👍 2

Dan Achin

04/22/2021, 11:06 PM

For all intents and purposes, wouldn't setting

buffered_log_max=1

purge the local DBs? We dropped that setting way down when dealing with an issues were all POSTs from many of our clients ended up with http 400 errors through our nginx layer in front of Fleet. If you control your osquery options at Fleet, this is a quick way to get all of the local data to expire out. I've often wondered if there was a way to read the local rocks DB, or maybe a utility to check the integrity of the data?

Juan Alvarez

04/22/2021, 11:07 PM

I tried that parameter but that did not work neither

Juan Alvarez

04/22/2021, 11:09 PM

you can use sst_dump to see what is in the DB

Dan Achin

04/22/2021, 11:16 PM

thanks, will look at sst_dump

zwass

04/22/2021, 11:53 PM

osqueryd  --database_dump

buffered_log_max

will clear the buffered TLS logs, but there are other things stored in the DB

👍 1

defensivedepth

04/23/2021, 11:55 AM

Is this a known osquery issue? I dont recall ever seeing that errror

Francisco Huerta

04/23/2021, 12:18 PM

Ok, so it seems we've got some stuff to look into: we've discovered some of the hosts that we are monitoring are particularly chatty when it comes to the generation of win events. A step by step description would be as follows:

Francisco Huerta

04/23/2021, 12:19 PM

1. Those servers have packs configured with queries that collect all win events every 60s (incremental).

Francisco Huerta

04/23/2021, 12:21 PM

2. Apparently, the number of events collected exceeds the current events limit per TLS connection (1024), causing the a local buffering effect in the osquery agent.

Francisco Huerta

04/23/2021, 12:22 PM

3. Those servers keeping the pace in terms of events generation (> 1024 per minute) continue queuing those over and over, resulting in a ever-growing buffer

Francisco Huerta

04/23/2021, 12:22 PM

4. Fleet servers, on their side, experience some TLS but also unexpected JSON EOFs as shown above

Francisco Huerta

04/23/2021, 12:24 PM

The result is that 'unstable' condition I was mentioning at the beginning, a considerable amount of data stored locally that never gets purged and some other side effects (e.g., an increase in the I/O operations in the hosts disks)

Francisco Huerta

04/23/2021, 12:25 PM

All this is the theory we are exploring and of course the first thing to look at is the tweaking of the TLS events limit, together with a lower interval for the events packs (e.g., every 30s or even less)

Francisco Huerta

04/23/2021, 12:25 PM

Any other opinions, experiences are much appreciated. I'll keep everyone posted on the progress here just in case this helps anyone else.

Dan Achin

04/23/2021, 5:22 PM

interesting

Dan Achin

04/23/2021, 5:22 PM

this is somewhat similar to what we saw a few weeks ago and I just checked our logs and see we used to get those EOF messages as well. we didn't focus on those particular errors as there were only a few hundred, but they do seem to have stopped once we did a mass expiration of locally buffered data by dropping do the

buffered_log_max

Dan Achin

04/23/2021, 5:29 PM

Our issue manifested itself with massive bandwidth consumption from clients as they got stuck trying to send buffered log data to Fleet but were unable to for long periods of time. A lot of stuff about it is here in case anyone wants to read it: https://github.com/osquery/osquery/issues/7021

Francisco Huerta

04/23/2021, 7:11 PM

thanks, @Dan Achin, good insights too

👍 1

10 Views

Open in Slack

Previous Next