Hi everyone great to meet you I was wondering if I could ask osquery #fleet

Hi everyone - great to meet you! I was wondering i...

John Kornfeld

12/23/2021, 12:33 AM

Hi everyone - great to meet you! I was wondering if I could ask for help with a specific question and a general question. I've been running an instance of Fleet with about 40 machines running OSquery 5.0.1 via an orbit installer at 3 different remote sites. Yesterday around 12:15, they all stopped calling home. The orbit installers I used still work - if I run an the exact installer used at these sites on a test machine, the test host starts calling home. These hosts dropping offline in Fleet coincided with a drop off in network traffic at the application gateway in front of our fleet server. I've confirmed out of band that the machines at these sites are alive and healthy - but for the next few days, I'm not able to access them. One suspicion I have is an orbit-driven update... is there any way to identify any recently published updates on the public orbit update server? Would anyone happen to have any troubleshooting ideas while I wait for access to these sites? Thanks!

zwass

12/23/2021, 12:37 AM

Hi John, sorry to hear this happened. We did push an Orbit update yesterday that seemed to cause problems on some hosts (mainly Windows in our observations). Were yours Windows hosts? Please bear with us as we work out the kinks to get Orbit to a stable release.

John Kornfeld

12/23/2021, 12:38 AM

Thanks! They were windows hosts. Appreciate the info and all the work you do! Is this tracked by issue #3456?

zwass

12/23/2021, 12:41 AM

That one should only effect hosts that were not previously enrolled.

John Kornfeld

12/23/2021, 12:46 AM

Oh got it, thanks; Would there happen to be any issue tracking the issue you mentioned? (so I don't have to pepper you with questions here 🙂 )

zwass

12/23/2021, 12:47 AM

I need to file one. Can you please check your hosts now -- are they still offline? I just pushed an update that may have fixed the issue (it did on the VM where I was able to reproduce).

👍 1

John Kornfeld

12/23/2021, 12:48 AM

Oh nice - let me check...

John Kornfeld

12/23/2021, 12:49 AM

Ok, not at the moment - but I'm not sure what the internal update timings are in Orbit

zwass

12/23/2021, 12:53 AM

Hmmm, okay. It should be pretty quick if it does work. If you are able to log onto one of those hosts we can do some more debugging.

John Kornfeld

12/23/2021, 12:58 AM

I greatly appreciate the offer - sadly at the moment I don't have access to these hosts until after the holidays. I'm happy to help debug if you happen to have a local repro, but for the moment, the only machine that's NOT Offline is the machine I tested our installers on - that machine has remained healthy for the last hour

zwass

12/23/2021, 1:01 AM

Yesterday at 12:15 -- in what timezone was that?

John Kornfeld

12/23/2021, 1:02 AM

Central - they dropped at ~18:15 UTC

John Kornfeld

12/23/2021, 1:04 AM

(I'm relying on the "Last Seen" mouse-over text in the Fleet hosts UI - which seems to reflect the more rapid TLS checkins? Rather than the "Updated At" timestamp in the host-details UI)

zwass

12/23/2021, 1:06 AM

That makes sense. By default the "updated at" is only done every hour.

👍 1

zwass

12/23/2021, 1:06 AM

I'm trying to find the exact time we pushed the update yesterday.

John Kornfeld

12/23/2021, 1:07 AM

Thanks so much

John Kornfeld

12/23/2021, 1:08 AM

That "Last seen" in the UI DOES seem inconsistent with our load balancer metrics, which shows traffic drop-off at 2:15AM UTC

John Kornfeld

12/23/2021, 1:09 AM

(Stepping away for ~30 mins - thanks again for the help)

zwass

12/23/2021, 1:19 AM

Confirmed 2:15AM UTC was the update time 😕

John Kornfeld

12/23/2021, 1:58 AM

😞 well at a minimum it's good to have root cause! And re-installing agents wouldn't be the end of the world for us

John Kornfeld

12/23/2021, 1:59 AM

Sadly I don't know Go, but if there's any way I can help test/investigate fixes etc, I'm happy to help

zwass

12/23/2021, 3:00 AM

When you get access to one of those machines, can you please go into "Services", right click and open "properties" for the `Orbit osquery`/`Fleet osquery` service, then copy the command line and run it in an admin powershell? The logs that generates can help us understand what's going on.

👍 1

zwass

12/23/2021, 3:01 AM

Windows Services just swallows up all of the logs generated by Services, so we still need to create a custom logging mechanism for Windows.

John Kornfeld

12/23/2021, 3:44 AM

Yup, happy to - might not be until after 12/26+, but I can share it when I'm able

zwass

01/08/2022, 5:21 PM

Were you able to get another look at these? Would be curious to know what you found.

3 Views

Open in Slack

Previous Next