Hi everyone - great to meet you! I was wondering if I could ask for help with a specific question and a general question.
I've been running an instance of Fleet with about 40 machines running OSquery 5.0.1 via an orbit installer at 3 different remote sites. Yesterday around 12:15, they all stopped calling home. The orbit installers I used still work - if I run an the exact installer used at these sites on a test machine, the test host starts calling home. These hosts dropping offline in Fleet coincided with a drop off in network traffic at the application gateway in front of our fleet server. I've confirmed out of band that the machines at these sites are alive and healthy - but for the next few days, I'm not able to access them.
One suspicion I have is an orbit-driven update... is there any way to identify any recently published updates on the public orbit update server? Would anyone happen to have any troubleshooting ideas while I wait for access to these sites? Thanks!
12/23/2021, 12:37 AM
Hi John, sorry to hear this happened. We did push an Orbit update yesterday that seemed to cause problems on some hosts (mainly Windows in our observations). Were yours Windows hosts? Please bear with us as we work out the kinks to get Orbit to a stable release.
12/23/2021, 12:38 AM
Thanks! They were windows hosts. Appreciate the info and all the work you do! Is this tracked by issue #3456?
12/23/2021, 12:41 AM
That one should only effect hosts that were not previously enrolled.
12/23/2021, 12:46 AM
Oh got it, thanks; Would there happen to be any issue tracking the issue you mentioned? (so I don't have to pepper you with questions here 🙂 )
12/23/2021, 12:47 AM
I need to file one. Can you please check your hosts now -- are they still offline? I just pushed an update that may have fixed the issue (it did on the VM where I was able to reproduce).
Ok, not at the moment - but I'm not sure what the internal update timings are in Orbit
12/23/2021, 12:53 AM
Hmmm, okay. It should be pretty quick if it does work. If you are able to log onto one of those hosts we can do some more debugging.
12/23/2021, 12:58 AM
I greatly appreciate the offer - sadly at the moment I don't have access to these hosts until after the holidays. I'm happy to help debug if you happen to have a local repro, but for the moment, the only machine that's NOT Offline is the machine I tested our installers on - that machine has remained healthy for the last hour
12/23/2021, 1:01 AM
Yesterday at 12:15 -- in what timezone was that?
12/23/2021, 1:02 AM
Central - they dropped at ~18:15 UTC
(I'm relying on the "Last Seen" mouse-over text in the Fleet hosts UI - which seems to reflect the more rapid TLS checkins? Rather than the "Updated At" timestamp in the host-details UI)
12/23/2021, 1:06 AM
That makes sense. By default the "updated at" is only done every hour.
I'm trying to find the exact time we pushed the update yesterday.
12/23/2021, 1:07 AM
Thanks so much
That "Last seen" in the UI DOES seem inconsistent with our load balancer metrics, which shows traffic drop-off at 2:15AM UTC
(Stepping away for ~30 mins - thanks again for the help)
12/23/2021, 1:19 AM
Confirmed 2:15AM UTC was the update time 😕
12/23/2021, 1:58 AM
😞 well at a minimum it's good to have root cause! And re-installing agents wouldn't be the end of the world for us
Sadly I don't know Go, but if there's any way I can help test/investigate fixes etc, I'm happy to help
12/23/2021, 3:00 AM
When you get access to one of those machines, can you please go into "Services", right click and open "properties" for the Orbit osquery/Fleet osquery service, then copy the command line and run it in an admin powershell? The logs that generates can help us understand what's going on.