Hi all I m trying to debug a situation where I think hosts a osquery #fleet

Hi all, I’m trying to debug a situation where I th...

Andrew Zick

01/29/2025, 8:26 AM

Hi all, I’m trying to debug a situation where I think hosts are checking in with our Fleet server and sending responses to additional queries when they “should” be offline. Ex: We have a simple query

SELECT * FROM windows_security_center

that we put in our additional queries for debugging and have a host that will return “good” for all the columns when live queried, but responses from the

/api/v1/fleet/hosts

endpoint will have the same host and same query with this:

Copy code

{
  "firewall": "Error",
  "antivirus": "Error",
  "autoupdate": "Good",
  "internet_settings": "Error",
  "user_account_control": "Error",
  "windows_security_center_service": "Error"
}

and the host with

"status": "offline",

It seems quite similar to this slack post from April 2024 so possibly something about start up/shutdown timing?

Kathy Satterlee

01/29/2025, 2:12 PM

Hi @Andrew Zick! Host detail queries are sent on an hourly basis by default, so it’s entirely possible to get slightly different results from a live query. That being said, it should be a fairly short-lived state for something like this that I wouldn’t expect to change a whole lot over the course of a day. A few questions; You say that the hosts should be offline… do you mean that the hosts are showing up as “offline” in Fleet, but responding to the Live query? Does refetching the host update the value in the API for your additional query? Are you seeing any errors in the Fleet server logs around query ingestion (or in general)?

Andrew Zick

01/29/2025, 6:28 PM

do you mean that the hosts are showing up as “offline” in Fleet, but responding to the Live query?

No sorry that’s poor phrasing on my part. I meant that I wouldn’t expect these hosts to have different responses saved in the

/hosts

endpoint when offline vs. online. This could be a misunderstanding of Fleet server behavior on my part. Is the

/hosts

endpoint response simply a summary of what Fleet has in its database at the time, and the database is updated via the Host detail queries that you mentioned?

Does refetching the host update the value in the API for your additional query?

The host is offline so I can’t refetch it via the UI, and it doesn’t seem like the

/hosts/:identifier/refetch

sets

refetch_requested

to true either (when the host is offline)? But again I described the situation poorly so I bet this is expected.

Are you seeing any errors in the Fleet server logs around query ingestion (or in general)?

To be honest I don’t think I’ve ever looked at our Fleet server logs. Are those one of these three kinds of logs? Or the underlying MYSQL db logs?

Kathy Satterlee

01/29/2025, 6:51 PM

> This could be a misunderstanding of Fleet server behavior on my part. Is the

/hosts

endpoint response simply a summary of what Fleet has in its database at the time, and the database is updated via the Host detail queries that you mentioned? This might be the key factor. Any data in the Fleet UI and API is updated on a set interval when the host is online, so you're seeing the last known state of the host when you fetch things from the API.

Kathy Satterlee

01/29/2025, 6:52 PM

Other than running a Live Query - Those results are fetched fresh from the host, assuming that it is online

Andrew Zick

01/29/2025, 11:11 PM

so you’re seeing the last known state of the host when you fetch things from the API

Hm, okay, so possibly what’s happening is that the host is sending details after the user has logged out, so Windows no longer has access to the information that

windows_security_center

relies on?

Unthread

01/29/2025, 11:12 PM

That's entirely possible if the machine is still powered up with an internet connection.

Andrew Zick

01/29/2025, 11:16 PM

Hmmm. Thank you for your help! I wonder how other people have solved this problem of checking for Firewall status, then. Maybe we’ll have to maintain state outside the Fleet server and ignore results when there are zero

logged_in_users

. Or I’ve seen other queries that check registry entries for Firewall being enabled, but I figured that would suffer from the same not-logged-in->no-access-to-hkeys 🤔

Unthread

01/30/2025, 12:31 AM

I still think it's worth checking out the Fleet Server logs (the service logs from the running Fleet instance), and the osquery logs to see if there are any errors there before we assume that's the actual scenario we're running in to here.

Andrew Zick

01/30/2025, 6:11 PM

For a Fleet server running on EKS, where would those logs be by default? Or is there no default and I need to configure a destination in the Fleet server config?

Kathy Satterlee

01/30/2025, 6:36 PM

I believe the default there is each node's

/var/log/pods

If you don't have CloudWatch or Container Insights set up.

Kathy Satterlee

01/30/2025, 6:37 PM

You can also pull a debug archive with

fleetctl debug archive

. That contains some aggregated logs pulled from Redis. We might not get the complete picture, but it's a good start.

Kathy Satterlee

01/30/2025, 6:38 PM

I'm also happy to take a look at that archive if you'd like to send it to me in a DM.

Kathy Satterlee

01/30/2025, 6:39 PM

Slack won't like the tar format, but just wrapping it in a zip archive works well.

Andrew Zick

01/30/2025, 7:31 PM

fleetctl debug archive

command output:

Copy code

secureframe-cdk git:(master) ✗ fleetctl debug archive                          
Warning: Version mismatch.
Client Version:   4.51.1
Server Version:  0.0.0-SNAPSHOT-b31e25a
Ran allocs
Ran block
Ran cmdline
Failed errors: get errors received status 500
Ran goroutine
Ran heap
Ran mutex
Ran profile
Ran threadcreate
Ran trace
Ran db-locks
Ran db-innodb-status
Ran db-process-list
################################################################################
# WARNING:
#   The files in the generated archive may contain sensitive data.
#   Please review them before sharing.
#
#   Archive written to: fleet-profiles-archive-20250130112845Z.tar.gz
################################################################################

Don’t worry about the client mismatch, that’s on purpose. I’m worried about the

Failed errors: get errors received status 500

line 😬 possibly meaning there’s no error logs included…but still dm’ing the Zipfile right now.

Kathy Satterlee

01/31/2025, 4:57 PM

We sometimes see that when there's a large number of errors in the store. Taking a look at what you've shared now!

Unthread

01/31/2025, 5:16 PM

Do you have a lot of VMs in your environment? And is this host one of them?

Kathy Satterlee

01/31/2025, 5:28 PM

I'm seeing patterns in the logs that indicate that there are hosts that share a hardware UUID, which is the default identifier used by both

osquery

and

fleetd

for host enrollment. That usually points to VMs that are either pulling the UUID from their host machine or are hardcoded with the same UUID.

Kathy Satterlee

01/31/2025, 5:28 PM

Here's a quick breakdown on how that works.

Kathy Satterlee

01/31/2025, 5:33 PM

1. The first time a host checks in to Fleet, it sends over an enrollment request, including it’s osquery identifier. By default, this is ‘UUID’ 2. Fleet checks to see if it already has a host on file with that identifier. a. If it does, This host is associated with the existing one in Fleet and we see this in the logs: i. level=warn msg="osquery host with duplicate identifier has enrolled in Fleet and will overwrite existing host data" identifier=<uuid> host_id=<id> b. If it doesn’t, a new host is created. 3. Fleet generates a node key to be used to authenticate requests, updates the Fleet host with that node key, and sends it back to osquery 4. Every time osquery sends a request to Fleet, it includes the node key for authentication a. If Fleet finds that node key, it accepts the request and uses it to identify the host b. If Fleet doesn’t find that key, it sends an error message and logs the invalid node key i. We see this in the server logs 1. level=info err="authentication error: invalid node key:<node key> ii. The host enrolls again

Kathy Satterlee

01/31/2025, 5:38 PM

The fix for that is to build a new package that uses

instance

as the host identifier. This is a unique UUID associated with the actual osquery database present on the host.

Copy code

fleetctl package --host-identifier instance [...otherFlags]

Kathy Satterlee

01/31/2025, 5:41 PM

IF this host is one of the affected hosts, the data in the detail query may actually be from an entirely different machine than is responding to live queries.

Andrew Zick

01/31/2025, 7:34 PM

Do you have a lot of VMs in your environment? And is this host one of them?

I don’t believe this host is a VM, based on the customer that they’re a part of. We definitely have a handful of VMs, often it’s EC2 instances that people install the agent on, but it’s a minority of our ~10,000 hosts.

Andrew Zick

01/31/2025, 7:59 PM

That breakdown of the handshake, thank you so much, I will be adding that explanation to our internal docs (with a link). About

--host-identifier

, we do something kinda funky. Back in ~2022 when we first set up Fleet, the original dev added an option to

fleetctl package

to allow for passing in a specific “HostId” which is then passed into Orbit’s startup command, like so:

Copy code

ExecStart=/opt/orbit/bin/orbit/orbit {{ if .HostId }} -- --host_identifier specified --specified_identifier {{ .HostId }} {{ end }}

“HostId” could be duplicated if a customer tries to install the same Agent package on multiple hosts, so those errors being present are plausible + expected in some amount. But combining with what you’re saying, I wonder if the first time Orbit starts, it’s not starting with our custom HostId but instead the default UUID? I say this because our specified identifiers are 3 UUIDs concatenated with “:” separating each one, so a single UUID seems wrong. We’ve had a long running issues with “ghost devices” where two hosts will share the same serial number + UUID, but one never checks in and is missing most info. I’ll attach a screenshot as an example. However this is beyond the original scope of this thread / question. And the original host in question does not have a ghost host associated with it. I don’t see the original host’s UUID in the Fleet webserver logs, but it hasn’t checked in for the last 9 days, so this isn’t surprising. I think my next step is to monitor the original host and check webserver logs when it next comes online?

Unthread

02/06/2025, 5:12 PM

That would definitely be the case. Orbit only supports 'hostname' and 'instance' as identifiers.

49 Views

Open in Slack

Previous Next