Hi. We've rolled out a POC for Fleet and osquery ...
# kolide
d
Hi. We've rolled out a POC for Fleet and osquery and I'm trying to understand the offline timeout for our clients as they seem to be dropping offline very quickly. I was given this code as the Offline (MIA) duration, which would seem to be a decently long amount of time - https://github.com/kolide/fleet/blob/7494513400b1d15d3e770358350d227ffbe2e4ce/server/kolide/hosts.go#L33. Is there a list of client events that would trigger an online status? I'm assuming config_regfresh or a check in to look for new distributed queries, distributed_interval, and probably several other client actions should be flagging them as valid.
z
Just caught your other thread. Copying over my reply.
The linked code is the duration for MIA (hosts that have not been seen for 30 days). The online status is calculated for each host based on the observed intervals set for
config_refresh
,
logger_tls_period
, and
distributed_interval
. IIRC we give some grace period over the "expected" interval.
Do you by chance have any of those intervals set to something lower than 10 seconds?
d
thanks. Yes we do. We felt 10 seconds was pretty aggressive for config_refresh and distributed_interval
they are both currently set to 3600, though that was before we really understood what the latter meant and we planned to set it to 60
z
How quickly do they go offline? Fleet should not set them to offline for quite some time if that's your interval.
d
are you recommending we keep that at 10? we noticed a lot of open connections and file descriptors to redis and though the frequency of these check-ins might be too much
most of these had just registered in the last couple of days
so it seems to me that something else is amiss if they aren't showing up online after just registering
z
Are the osquery processes still running on them?
10s is a pretty typical distributed interval until you get to tens of thousands of hosts.
👍 1
d
it is on mine. 🙂, but I've got a request out to verify that since I don't have access
z
It is on yours and yours shows as offline?
d
that's another good question. 🙂. the fleet servers were apparently just taken offline as I was asking this question so I'll have to wait to access the UI again. I think the spirit of my question has been answered though - something else appears to be wrong. We'll do more digging here once the systems are back up.
z
Something else wrong would be my top guess, followed by a bug in the Fleet code (much less likely in my judgement). Let us know what you find!
d
Definitely, thanks for the feedback. One follow up - I can see why you'd want to be relatively aggressive on distributed_interval, but less so on config_refresh. We were planning on actually reducing that one to maybe every 12 hours. Do you see the osquery / fleet community refreshing the config frequently?
z
Yeah, no need for it to be very short for most folks. Shorter means it's quicker to see changes roll out which is nice.
d
sure, makes sense. We are reasoning that most changes would be to distributed queries. I guess we'll find out as we progress. 🙂
1