Title
#fleet
j

John Kornfeld

12/23/2021, 12:33 AM
Hi everyone - great to meet you! I was wondering if I could ask for help with a specific question and a general question. I've been running an instance of Fleet with about 40 machines running OSquery 5.0.1 via an orbit installer at 3 different remote sites. Yesterday around 12:15, they all stopped calling home. The orbit installers I used still work - if I run an the exact installer used at these sites on a test machine, the test host starts calling home. These hosts dropping offline in Fleet coincided with a drop off in network traffic at the application gateway in front of our fleet server. I've confirmed out of band that the machines at these sites are alive and healthy - but for the next few days, I'm not able to access them. One suspicion I have is an orbit-driven update... is there any way to identify any recently published updates on the public orbit update server? Would anyone happen to have any troubleshooting ideas while I wait for access to these sites? Thanks!
zwass

zwass

12/23/2021, 12:37 AM
Hi John, sorry to hear this happened. We did push an Orbit update yesterday that seemed to cause problems on some hosts (mainly Windows in our observations). Were yours Windows hosts? Please bear with us as we work out the kinks to get Orbit to a stable release.
j

John Kornfeld

12/23/2021, 12:38 AM
Thanks! They were windows hosts. Appreciate the info and all the work you do! Is this tracked by issue #3456?
zwass

zwass

12/23/2021, 12:41 AM
That one should only effect hosts that were not previously enrolled.
j

John Kornfeld

12/23/2021, 12:46 AM
Oh got it, thanks; Would there happen to be any issue tracking the issue you mentioned? (so I don't have to pepper you with questions here 🙂 )
zwass

zwass

12/23/2021, 12:47 AM
I need to file one. Can you please check your hosts now -- are they still offline? I just pushed an update that may have fixed the issue (it did on the VM where I was able to reproduce).
j

John Kornfeld

12/23/2021, 12:48 AM
Oh nice - let me check...
12:49 AM
Ok, not at the moment - but I'm not sure what the internal update timings are in Orbit
zwass

zwass

12/23/2021, 12:53 AM
Hmmm, okay. It should be pretty quick if it does work. If you are able to log onto one of those hosts we can do some more debugging.
j

John Kornfeld

12/23/2021, 12:58 AM
I greatly appreciate the offer - sadly at the moment I don't have access to these hosts until after the holidays. I'm happy to help debug if you happen to have a local repro, but for the moment, the only machine that's NOT Offline is the machine I tested our installers on - that machine has remained healthy for the last hour
zwass

zwass

12/23/2021, 1:01 AM
Yesterday at 12:15 -- in what timezone was that?
j

John Kornfeld

12/23/2021, 1:02 AM
Central - they dropped at ~18:15 UTC
1:04 AM
(I'm relying on the "Last Seen" mouse-over text in the Fleet hosts UI - which seems to reflect the more rapid TLS checkins? Rather than the "Updated At" timestamp in the host-details UI)
zwass

zwass

12/23/2021, 1:06 AM
That makes sense. By default the "updated at" is only done every hour.
1:06 AM
I'm trying to find the exact time we pushed the update yesterday.
j

John Kornfeld

12/23/2021, 1:07 AM
Thanks so much
1:08 AM
That "Last seen" in the UI DOES seem inconsistent with our load balancer metrics, which shows traffic drop-off at 2:15AM UTC
1:09 AM
(Stepping away for ~30 mins - thanks again for the help)
zwass

zwass

12/23/2021, 1:19 AM
Confirmed 2:15AM UTC was the update time 😕
j

John Kornfeld

12/23/2021, 1:58 AM
😞 well at a minimum it's good to have root cause! And re-installing agents wouldn't be the end of the world for us
1:59 AM
Sadly I don't know Go, but if there's any way I can help test/investigate fixes etc, I'm happy to help
zwass

zwass

12/23/2021, 3:00 AM
When you get access to one of those machines, can you please go into "Services", right click and open "properties" for the Orbit osquery/Fleet osquery service, then copy the command line and run it in an admin powershell? The logs that generates can help us understand what's going on.
3:01 AM
Windows Services just swallows up all of the logs generated by Services, so we still need to create a custom logging mechanism for Windows.
j

John Kornfeld

12/23/2021, 3:44 AM
Yup, happy to - might not be until after 12/26+, but I can share it when I'm able
zwass

zwass

01/08/2022, 5:21 PM
Were you able to get another look at these? Would be curious to know what you found.