Hi all, I’m trying to debug a situation where I th...
# fleet
a
Hi all, I’m trying to debug a situation where I think hosts are checking in with our Fleet server and sending responses to additional queries when they “should” be offline. Ex: We have a simple query
SELECT * FROM windows_security_center
that we put in our additional queries for debugging and have a host that will return “good” for all the columns when live queried, but responses from the
/api/v1/fleet/hosts
endpoint will have the same host and same query with this:
Copy code
{
  "firewall": "Error",
  "antivirus": "Error",
  "autoupdate": "Good",
  "internet_settings": "Error",
  "user_account_control": "Error",
  "windows_security_center_service": "Error"
}
and the host with
"status": "offline",
It seems quite similar to this slack post from April 2024 so possibly something about start up/shutdown timing?
k
Hi @Andrew Zick! Host detail queries are sent on an hourly basis by default, so it’s entirely possible to get slightly different results from a live query. That being said, it should be a fairly short-lived state for something like this that I wouldn’t expect to change a whole lot over the course of a day. A few questions; You say that the hosts should be offline… do you mean that the hosts are showing up as “offline” in Fleet, but responding to the Live query? Does refetching the host update the value in the API for your additional query? Are you seeing any errors in the Fleet server logs around query ingestion (or in general)?
a
do you mean that the hosts are showing up as “offline” in Fleet, but responding to the Live query?
No sorry that’s poor phrasing on my part. I meant that I wouldn’t expect these hosts to have different responses saved in the
/hosts
endpoint when offline vs. online. This could be a misunderstanding of Fleet server behavior on my part. Is the
/hosts
endpoint response simply a summary of what Fleet has in its database at the time, and the database is updated via the Host detail queries that you mentioned?
Does refetching the host update the value in the API for your additional query?
The host is offline so I can’t refetch it via the UI, and it doesn’t seem like the
/hosts/:identifier/refetch
sets
refetch_requested
to true either (when the host is offline)? But again I described the situation poorly so I bet this is expected.
Are you seeing any errors in the Fleet server logs around query ingestion (or in general)?
To be honest I don’t think I’ve ever looked at our Fleet server logs. Are those one of these three kinds of logs? Or the underlying MYSQL db logs?
k
> This could be a misunderstanding of Fleet server behavior on my part. Is the
/hosts
endpoint response simply a summary of what Fleet has in its database at the time, and the database is updated via the Host detail queries that you mentioned? This might be the key factor. Any data in the Fleet UI and API is updated on a set interval when the host is online, so you're seeing the last known state of the host when you fetch things from the API.
Other than running a Live Query - Those results are fetched fresh from the host, assuming that it is online
a
so you’re seeing the last known state of the host when you fetch things from the API
Hm, okay, so possibly what’s happening is that the host is sending details after the user has logged out, so Windows no longer has access to the information that
windows_security_center
relies on?
u
That's entirely possible if the machine is still powered up with an internet connection.
a
Hmmm. Thank you for your help! I wonder how other people have solved this problem of checking for Firewall status, then. Maybe we’ll have to maintain state outside the Fleet server and ignore results when there are zero
logged_in_users
. Or I’ve seen other queries that check registry entries for Firewall being enabled, but I figured that would suffer from the same not-logged-in->no-access-to-hkeys 🤔
u
I still think it's worth checking out the Fleet Server logs (the service logs from the running Fleet instance), and the osquery logs to see if there are any errors there before we assume that's the actual scenario we're running in to here.
a
For a Fleet server running on EKS, where would those logs be by default? Or is there no default and I need to configure a destination in the Fleet server config?
k
I believe the default there is each node's
/var/log/pods
If you don't have CloudWatch or Container Insights set up.
You can also pull a debug archive with
fleetctl debug archive
. That contains some aggregated logs pulled from Redis. We might not get the complete picture, but it's a good start.
I'm also happy to take a look at that archive if you'd like to send it to me in a DM.
Slack won't like the tar format, but just wrapping it in a zip archive works well.
a
fleetctl debug archive
command output:
Copy code
secureframe-cdk git:(master) ✗ fleetctl debug archive                          
Warning: Version mismatch.
Client Version:   4.51.1
Server Version:  0.0.0-SNAPSHOT-b31e25a
Ran allocs
Ran block
Ran cmdline
Failed errors: get errors received status 500
Ran goroutine
Ran heap
Ran mutex
Ran profile
Ran threadcreate
Ran trace
Ran db-locks
Ran db-innodb-status
Ran db-process-list
################################################################################
# WARNING:
#   The files in the generated archive may contain sensitive data.
#   Please review them before sharing.
#
#   Archive written to: fleet-profiles-archive-20250130112845Z.tar.gz
################################################################################
Don’t worry about the client mismatch, that’s on purpose. I’m worried about the
Failed errors: get errors received status 500
line 😬 possibly meaning there’s no error logs included…but still dm’ing the Zipfile right now.
k
We sometimes see that when there's a large number of errors in the store. Taking a look at what you've shared now!
u
Do you have a lot of VMs in your environment? And is this host one of them?
k
I'm seeing patterns in the logs that indicate that there are hosts that share a hardware UUID, which is the default identifier used by both
osquery
and
fleetd
for host enrollment. That usually points to VMs that are either pulling the UUID from their host machine or are hardcoded with the same UUID.
Here's a quick breakdown on how that works.
1. The first time a host checks in to Fleet, it sends over an enrollment request, including it’s osquery identifier. By default, this is ‘UUID’ 2. Fleet checks to see if it already has a host on file with that identifier. a. If it does, This host is associated with the existing one in Fleet and we see this in the logs: i. level=warn msg="osquery host with duplicate identifier has enrolled in Fleet and will overwrite existing host data" identifier=<uuid> host_id=<id> b. If it doesn’t, a new host is created. 3. Fleet generates a node key to be used to authenticate requests, updates the Fleet host with that node key, and sends it back to osquery 4. Every time osquery sends a request to Fleet, it includes the node key for authentication a. If Fleet finds that node key, it accepts the request and uses it to identify the host b. If Fleet doesn’t find that key, it sends an error message and logs the invalid node key i. We see this in the server logs 1. level=info err="authentication error: invalid node key:<node key> ii. The host enrolls again
The fix for that is to build a new package that uses
instance
as the host identifier. This is a unique UUID associated with the actual osquery database present on the host.
Copy code
fleetctl package --host-identifier instance [...otherFlags]
IF this host is one of the affected hosts, the data in the detail query may actually be from an entirely different machine than is responding to live queries.
a
Do you have a lot of VMs in your environment? And is this host one of them?
I don’t believe this host is a VM, based on the customer that they’re a part of. We definitely have a handful of VMs, often it’s EC2 instances that people install the agent on, but it’s a minority of our ~10,000 hosts.
That breakdown of the handshake, thank you so much, I will be adding that explanation to our internal docs (with a link). About
--host-identifier
, we do something kinda funky. Back in ~2022 when we first set up Fleet, the original dev added an option to
fleetctl package
to allow for passing in a specific “HostId” which is then passed into Orbit’s startup command, like so:
Copy code
ExecStart=/opt/orbit/bin/orbit/orbit {{ if .HostId }} -- --host_identifier specified --specified_identifier {{ .HostId }} {{ end }}
“HostId” could be duplicated if a customer tries to install the same Agent package on multiple hosts, so those errors being present are plausible + expected in some amount. But combining with what you’re saying, I wonder if the first time Orbit starts, it’s not starting with our custom HostId but instead the default UUID? I say this because our specified identifiers are 3 UUIDs concatenated with “:” separating each one, so a single UUID seems wrong. We’ve had a long running issues with “ghost devices” where two hosts will share the same serial number + UUID, but one never checks in and is missing most info. I’ll attach a screenshot as an example. However this is beyond the original scope of this thread / question. And the original host in question does not have a ghost host associated with it. I don’t see the original host’s UUID in the Fleet webserver logs, but it hasn’t checked in for the last 9 days, so this isn’t surprising. I think my next step is to monitor the original host and check webserver logs when it next comes online?
u
That would definitely be the case. Orbit only supports 'hostname' and 'instance' as identifiers.