Title
#fleet
f

Francisco Huerta

04/22/2021, 7:07 PM
Hi, guys. In the seek of your help again: is there any way we can instruct the agents from Fleet to purge the RocksDB contents? Following with our tests we see a number of agents that have become unstable and we suspect it might be related to their internal database. Is there any way we could handle this cleanup centrally? Any advise or similar experiences?
Noah Talerman

Noah Talerman

04/22/2021, 7:16 PM
Hi @Francisco Huerta. What do you mean when you use the phrase “become unstable?” Are these agents failing to communicate with Fleet?
f

Francisco Huerta

04/22/2021, 7:27 PM
Thanks, @Noah Talerman. We see errors from those agents trying to establish a connection to send data to the Fleet for example
Noah Talerman

Noah Talerman

04/22/2021, 7:35 PM
Got it. Do you mind sharing the errors you’re seeing for those agents in this thread?
f

Francisco Huerta

04/22/2021, 7:50 PM
7:50 PM
there you go
Noah Talerman

Noah Talerman

04/22/2021, 9:24 PM
Thank you! I don’t have an immediate answer to your question on purging RocksDB via Fleet. Working on getting an answer for you now.
zwass

zwass

04/22/2021, 9:33 PM
If you want to fully purge the rocksdb database you'd need to delete the database directory on the host. Can you run
--verbose --tls_dump
on one of those hosts experiencing the issue and see if you can tell what they are sending?
f

Francisco Huerta

04/22/2021, 9:59 PM
thanks both. yep, we're familiarized with purging the contents of the directory but we were looking at something automatic and/or centralized
9:59 PM
thanks @zwass @Noah Talerman
zwass

zwass

04/22/2021, 10:15 PM
This seems like a feature that could be useful to add to Orbit
Dan Achin

Dan Achin

04/22/2021, 11:06 PM
For all intents and purposes, wouldn't setting
buffered_log_max=1
purge the local DBs? We dropped that setting way down when dealing with an issues were all POSTs from many of our clients ended up with http 400 errors through our nginx layer in front of Fleet. If you control your osquery options at Fleet, this is a quick way to get all of the local data to expire out. I've often wondered if there was a way to read the local rocks DB, or maybe a utility to check the integrity of the data?
j

Juan Alvarez

04/22/2021, 11:07 PM
I tried that parameter but that did not work neither
11:09 PM
you can use sst_dump to see what is in the DB
Dan Achin

Dan Achin

04/22/2021, 11:16 PM
thanks, will look at sst_dump
zwass

zwass

04/22/2021, 11:53 PM
osqueryd  --database_dump
.
buffered_log_max
will clear the buffered TLS logs, but there are other things stored in the DB
defensivedepth

defensivedepth

04/23/2021, 11:55 AM
Is this a known osquery issue? I dont recall ever seeing that errror
f

Francisco Huerta

04/23/2021, 12:18 PM
Ok, so it seems we've got some stuff to look into: we've discovered some of the hosts that we are monitoring are particularly chatty when it comes to the generation of win events. A step by step description would be as follows:
12:19 PM
1. Those servers have packs configured with queries that collect all win events every 60s (incremental).
12:21 PM
2. Apparently, the number of events collected exceeds the current events limit per TLS connection (1024), causing the a local buffering effect in the osquery agent.
12:22 PM
3. Those servers keeping the pace in terms of events generation (> 1024 per minute) continue queuing those over and over, resulting in a ever-growing buffer
12:22 PM
4. Fleet servers, on their side, experience some TLS but also unexpected JSON EOFs as shown above
12:24 PM
The result is that 'unstable' condition I was mentioning at the beginning, a considerable amount of data stored locally that never gets purged and some other side effects (e.g., an increase in the I/O operations in the hosts disks)
12:25 PM
All this is the theory we are exploring and of course the first thing to look at is the tweaking of the TLS events limit, together with a lower interval for the events packs (e.g., every 30s or even less)
12:25 PM
Any other opinions, experiences are much appreciated. I'll keep everyone posted on the progress here just in case this helps anyone else.
Dan Achin

Dan Achin

04/23/2021, 5:22 PM
interesting
5:22 PM
this is somewhat similar to what we saw a few weeks ago and I just checked our logs and see we used to get those EOF messages as well. we didn't focus on those particular errors as there were only a few hundred, but they do seem to have stopped once we did a mass expiration of locally buffered data by dropping do the
buffered_log_max
5:29 PM
Our issue manifested itself with massive bandwidth consumption from clients as they got stuck trying to send buffered log data to Fleet but were unable to for long periods of time. A lot of stuff about it is here in case anyone wants to read it:https://github.com/osquery/osquery/issues/7021
f

Francisco Huerta

04/23/2021, 7:11 PM
thanks, @Dan Achin, good insights too