Hi, guys. In the seek of your help again: is there...
# fleet
f
Hi, guys. In the seek of your help again: is there any way we can instruct the agents from Fleet to purge the RocksDB contents? Following with our tests we see a number of agents that have become unstable and we suspect it might be related to their internal database. Is there any way we could handle this cleanup centrally? Any advise or similar experiences?
n
Hi @Francisco Huerta. What do you mean when you use the phrase “become unstable?” Are these agents failing to communicate with Fleet?
f
Thanks, @Noah Talerman. We see errors from those agents trying to establish a connection to send data to the Fleet for example
n
Got it. Do you mind sharing the errors you’re seeing for those agents in this thread?
f
👍 1
there you go
n
Thank you! I don’t have an immediate answer to your question on purging RocksDB via Fleet. Working on getting an answer for you now.
z
If you want to fully purge the rocksdb database you'd need to delete the database directory on the host. Can you run
--verbose --tls_dump
on one of those hosts experiencing the issue and see if you can tell what they are sending?
f
thanks both. yep, we're familiarized with purging the contents of the directory but we were looking at something automatic and/or centralized
thanks @zwass @Noah Talerman
z
This seems like a feature that could be useful to add to Orbit
👍 2
d
For all intents and purposes, wouldn't setting
buffered_log_max=1
purge the local DBs? We dropped that setting way down when dealing with an issues were all POSTs from many of our clients ended up with http 400 errors through our nginx layer in front of Fleet. If you control your osquery options at Fleet, this is a quick way to get all of the local data to expire out. I've often wondered if there was a way to read the local rocks DB, or maybe a utility to check the integrity of the data?
j
I tried that parameter but that did not work neither
you can use sst_dump to see what is in the DB
d
thanks, will look at sst_dump
z
osqueryd  --database_dump
.
buffered_log_max
will clear the buffered TLS logs, but there are other things stored in the DB
👍 1
d
Is this a known osquery issue? I dont recall ever seeing that errror
f
Ok, so it seems we've got some stuff to look into: we've discovered some of the hosts that we are monitoring are particularly chatty when it comes to the generation of win events. A step by step description would be as follows:
1. Those servers have packs configured with queries that collect all win events every 60s (incremental).
2. Apparently, the number of events collected exceeds the current events limit per TLS connection (1024), causing the a local buffering effect in the osquery agent.
3. Those servers keeping the pace in terms of events generation (> 1024 per minute) continue queuing those over and over, resulting in a ever-growing buffer
4. Fleet servers, on their side, experience some TLS but also unexpected JSON EOFs as shown above
The result is that 'unstable' condition I was mentioning at the beginning, a considerable amount of data stored locally that never gets purged and some other side effects (e.g., an increase in the I/O operations in the hosts disks)
All this is the theory we are exploring and of course the first thing to look at is the tweaking of the TLS events limit, together with a lower interval for the events packs (e.g., every 30s or even less)
Any other opinions, experiences are much appreciated. I'll keep everyone posted on the progress here just in case this helps anyone else.
d
interesting
this is somewhat similar to what we saw a few weeks ago and I just checked our logs and see we used to get those EOF messages as well. we didn't focus on those particular errors as there were only a few hundred, but they do seem to have stopped once we did a mass expiration of locally buffered data by dropping do the
buffered_log_max
Our issue manifested itself with massive bandwidth consumption from clients as they got stuck trying to send buffered log data to Fleet but were unable to for long periods of time. A lot of stuff about it is here in case anyone wants to read it: https://github.com/osquery/osquery/issues/7021
f
thanks, @Dan Achin, good insights too
👍 1