mysql hosts table looks like this for these hosts....
# fleet
g
mysql hosts table looks like this for these hosts. all other fields are empty
Copy code
id: 241686
     osquery_host_id: FD723F42-2034-8947-200D-9D1902CF7058
          created_at: 2022-12-12 12:12:43
          updated_at: 2022-12-12 12:12:43
   detail_updated_at: 1970-01-02 01:00:00
            node_key: 86UbEE48jOqr3epoiDHg4nLhEwgq390m
              uptime: 0
              memory: 0
  cpu_physical_cores: 0
   cpu_logical_cores: 0
       primary_ip_id: NULL
distributed_interval: 0
   logger_tls_period: 0
  config_tls_refresh: 0
    label_updated_at: 1970-01-02 01:00:00
    last_enrolled_at: 1970-01-02 01:00:00
   refetch_requested: 1
             team_id: NULL
   policy_updated_at: 1970-01-02 01:00:00
      orbit_node_key: 86UbEE48jOqr3epoiDHg4nLhEwgq390m
m
Hmm, this definitely shouldn’t be happening. Are you using orbit or plain osquery?
g
orbit
m
I may need to check with the rest of the team and get back to you.
Do you have host expiry enabled? I think these hosts will eventually be cleaned up, but it’s still an issue we should look into
To check, go to Settings -> Organization Settings -> Advanced options and see if the checkbox for Host expiry is checked
g
no i don't use the host expiration
i can just delete those from the db as well i guess
all with 1970-* as date
m
before you do, do you mind making a backup? we may need additional info for troubleshooting/debugging
Also, do you know if the
created_at
field is the same for all these affected hosts. Does it correspond with the time you upgraded fleet?
g
the created_at date keeps changing to the current day/time for all those hosts
I upgraded last friday
m
Is the total number of hosts increasing from last friday or staying constant? The created_at field should never be updated as far as I know
g
looks like hosts are being removed and new ones are being added the host with id "241686" from my original post does not exist anymore, but a new host was added (not a real one)
m
Are there any errors in the logs, potentially containing
/api/fleet/orbit/enroll
?
g
there are errors, but the same errors as before the upgrade to 4.24.1, and nothing with enroll in the uri
m
One more thing before I take this back to the team. Can you do a
select * from hosts where osquery_host_id = "FD723F42-2034-8947-200D-9D1902CF7058"
g
Copy code
id: 243411
     osquery_host_id: FD723F42-2034-8947-200D-9D1902CF7058
          created_at: 2022-12-12 13:12:56
          updated_at: 2022-12-12 13:12:56
   detail_updated_at: 1970-01-02 01:00:00
            node_key: +hDXpX6Z+97AtEdRIL9X1o7W3iiUOfvv
            hostname:
                uuid:
            platform:
     osquery_version:
          os_version:
               build:
       platform_like:
           code_name:
              uptime: 0
              memory: 0
            cpu_type:
         cpu_subtype:
           cpu_brand:
  cpu_physical_cores: 0
   cpu_logical_cores: 0
     hardware_vendor:
      hardware_model:
    hardware_version:
     hardware_serial:
       computer_name:
       primary_ip_id: NULL
distributed_interval: 0
   logger_tls_period: 0
  config_tls_refresh: 0
          primary_ip:
         primary_mac:
    label_updated_at: 1970-01-02 01:00:00
    last_enrolled_at: 1970-01-02 01:00:00
   refetch_requested: 1
             team_id: NULL
   policy_updated_at: 1970-01-02 01:00:00
           public_ip:
      orbit_node_key: +hDXpX6Z+97AtEdRIL9X1o7W3iiUOfvv
m
based on the ids and the created_at timestamps, it seems like there are quite a few hosts being created/deleted over a relatively short period of time. Are you experiencing any other issues ie degraded performance?
g
no other issues. in the last hour, the count for such hosts went from 1724 to 1725.
we have 2551 real hosts in fleet
m
This may be a little hard for you to find out, but are you able to identify which hosts are having issues? orbit/osquery logs from one of these hosts may be helpful
g
looks like all of these hosts are being deleted and added every 2 hours
m
yes, we have a cleanup job that periodically cleans up hosts. This explains why the same hosts appear to be recreated, but with new ids
g
ok. I don't have a way to check the logs on those 2551 hosts, we don't have centralized logging for the osquery/orbit logs
when deleting these hosts, they are instantly being recreated
m
okay, we will look in this further and get back to you later
ok. I don’t have a way to check the logs on those 2551 hosts, we don’t have centralized logging for the osquery/orbit logs
Do you mean you don’t any access at all to these hosts? The logs for even one of the affected hosts would really help here. See docs for where to find orbit logs on various platforms.
g
I have access to all hosts, but I don't know how to identify an affected host
m
Okay, there’s a couple ways I can think of, but may require some trial and error. 1. Run
osqueryi
on a host and execute the following sql query
select uuid from osquery_info
. See if it matches the
osquery_host_id
of an affected host in the fleet db 2. Run the following command (osx/linux)
ps -eo pid,lstart,command | grep osquery
and see if osquery is either not running, or has a start time that is very recent ie less than 1 hour ago.
g
I ran
osqueryd -S --json "select uuid from osquery_info"
on all of our windows hosts, and 612 uuid's from that query match with an osquery_host_id from those offline hosts in the fleet db
orbit was auto updated on these hosts on 2022-12-06 and I notice this in the logs, could this be related? INF initial flags update failed error="error getting flags from fleet: unauthenticated, or invalid token"
m
seems likely, looking into it
did you also check if osquery was running using 2 above?
on one of the hosts you found that was affected
g
so far i've only checked the windows hosts, and the service was running since Dec 6th on most of them need to check for linux still
m
On windows you can run the following in powershell
Get-Process osqueryd | select name,starttime
Just confirming that osqueryd is not running, and that the issue is limited to orbit
Do you see any errors in the fleet logs containing
/api/fleet/orbit/config
g
Both osqueryd & orbit processes are running, but this is also the case on non-problem hosts with a uuid that does not appear in the list of problematic fleetdb uuid's
yes, lots of these fleet[2878872]: {"component":"http","err":": Authentication required","internal":"authentication error: invalid orbit node key","level":"info","path":"/api/fleet/orbit/config","ts":"2022-12-13T130518.614380037Z"}
these logs match when there's an http 401, I've asked about this 2 months ago here: https://osquery.slack.com/archives/C01DXJL16D8/p1666699093295139
m
These two issues might be related.
Can you check if
C:\Program Files\Orbit\secret-orbit-node-key.txt
exists and is not empty?
g
yes, exists and not empty
m
Can you verify that it’s valid using the following sql query in fleet’s db
Copy code
select id from hosts where orbit_node_key = "secret"
Replacing secret with the value from the above file
g
and do this for all hosts with a matching uuid in the previous step probably?
m
let’s limit it to 1 or 2 hosts for now
g
When I run that orbit_node_key query, I get one of those problem hosts
And the orbit_node_key for that real host is NULL in the fleetdb
m
can you run
select * from hosts
for the problem host and the real host and paste the output?
g
Copy code
*************************** 1. row ***************************
                  id: 2088
     osquery_host_id: 89f17842-22be-4b0e-98ac-67efc907a9ba
          created_at: 2022-03-21 09:55:53
          updated_at: 2022-12-13 14:56:24
   detail_updated_at: 2022-12-13 14:56:24
            node_key: 4gEbEsUgRIX5yTYrQ7hLhplDbdpOT87U
            hostname: MASKED-web01
                uuid: 59043F42-0ECC-2981-B532-AD4EEA9D2814
            platform: windows
     osquery_version: 5.6.0
          os_version: Windows Server 2012 R2 Standard
               build: 9600
       platform_like: windows
           code_name: Microsoft Windows Server 2012 R2 Standard
              uptime: 2896149000000000
              memory: 4294967296
            cpu_type: x86_64
         cpu_subtype: -1
           cpu_brand: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
  cpu_physical_cores: 2
   cpu_logical_cores: 2
     hardware_vendor: VMware, Inc.
      hardware_model: VMware Virtual Platform
    hardware_version: -1
     hardware_serial: VMware-42 3f 04 59 cc 0e 81 29-b5 32 ad 4e ea 9d 28 14
       computer_name: MASKED-WEB01
       primary_ip_id: NULL
distributed_interval: 10
   logger_tls_period: 10
  config_tls_refresh: 60
          primary_ip: MASKED
         primary_mac: 00:50:56:bf:b6:20
    label_updated_at: 2022-12-13 14:55:34
    last_enrolled_at: 2022-03-21 09:55:53
   refetch_requested: 0
             team_id: NULL
   policy_updated_at: 2022-12-13 05:40:55
           public_ip: MASKED
      orbit_node_key: NULL


*************************** 1. row ***************************
                  id: 290414
     osquery_host_id: 59043F42-0ECC-2981-B532-AD4EEA9D2814
          created_at: 2022-12-13 14:08:47
          updated_at: 2022-12-13 14:08:47
   detail_updated_at: 1970-01-02 01:00:00
            node_key: 0onOi8pacBhK/lxzUd1NKLRd4JBP02Kf
            hostname:
                uuid:
            platform:
     osquery_version:
          os_version:
               build:
       platform_like:
           code_name:
              uptime: 0
              memory: 0
            cpu_type:
         cpu_subtype:
           cpu_brand:
  cpu_physical_cores: 0
   cpu_logical_cores: 0
     hardware_vendor:
      hardware_model:
    hardware_version:
     hardware_serial:
       computer_name:
       primary_ip_id: NULL
distributed_interval: 0
   logger_tls_period: 0
  config_tls_refresh: 0
          primary_ip:
         primary_mac:
    label_updated_at: 1970-01-02 01:00:00
    last_enrolled_at: 1970-01-02 01:00:00
   refetch_requested: 1
             team_id: NULL
   policy_updated_at: 1970-01-02 01:00:00
           public_ip:
      orbit_node_key: 0onOi8pacBhK/lxzUd1NKLRd4JBP02Kf
m
And the orbit_node_key for that real host is NULL in the fleetdb
just to clarify, is the above output from the same host? How do you know it’s the same host?
g
Because the orbit_node_key from the host with id 290414 matches the secret in the secret-orbit-node-key.txt file, but all the other host details match with host id 2088
m
Did you by any chance change ie add/remove and enroll secrets in fleet?
g
no, only 1 enroll secret and it hasn't changed since we deployed fleet
r
Hello All, Once Install osquery agent on node machines, I see this log message where I can't see the RAM data. Error system_info.cpp:73] Got error trying to determine the physically installed memory: SMBIOS data is malformed
m
@Raghavendra Hiremath is this related to the issue being discussed in this thread? If not, would you mind messaging in the #fleet channel?
Can you run the following osquery query on the affected host
Copy code
select * from osquery_flags where name = 'host_identifier'
g
MASKED-WEB01,"hostname","Field used to identify the host running osquery (hostname, uuid, instance, ephemeral, specified)","host_identifier","0","string","uuid"
m
One thing I noticed was that the
osquery_host_id
and the
uuid
for the real host don’t match. Trying to figure out why that is.
Can you go to the Fleet UI -> Settings -> Organization Settings -> Agent Options and see if there is anything set for the
command_line_flags
?
g
no we haven't used this
command_line_flags: {} # requires Fleet's osquery installer
m
do you mind pasting the whole yaml actually
g
Copy code
config:
  options:
    logger_plugin: tls
    disable_carver: true
    disable_tables: 'chrome_extensions,firefox_addons'
    logger_tls_period: 10
    distributed_plugin: tls
    disable_distributed: false
    logger_tls_endpoint: /api/osquery/log
    distributed_interval: 10
    carver_disable_function: true
    distributed_tls_max_attempts: 3
command_line_flags: {} # requires Fleet's osquery installer
m
On the affected host, can you run the following in powershell (may require admin)
Copy code
Get-CimInstance Win32_process -Filter "name ='osqueryd.exe'" | select CommandLine
g
Copy code
CommandLine : "C:\Program Files\Orbit\bin\osqueryd\windows\stable\osqueryd.exe" "--pidfile=C:\Program Files\Orbit\osque
              ry.pid" "--database_path=C:\Program Files\Orbit\osquery.db" --extensions_socket=\\.\pipe\orbit-osquery-ex
              tension "--logger_path=C:\Program Files\Orbit\osquery_log" --enroll_secret_env ENROLL_SECRET --host_ident
              ifier=uuid --tls_hostname=<http://fleet.x-ops.net|fleet.x-ops.net> --enroll_tls_endpoint=/api/v1/osquery/enroll --config_plu
              gin=tls --config_tls_endpoint=/api/v1/osquery/config --config_refresh=60 --disable_distributed=false --di
              stributed_plugin=tls --distributed_tls_max_attempts=10 --distributed_tls_read_endpoint=/api/v1/osquery/di
              stributed/read --distributed_tls_write_endpoint=/api/v1/osquery/distributed/write --logger_plugin=tls,fil
              esystem --logger_tls_endpoint=/api/v1/osquery/log --disable_carver=false --carver_disable_function=false
              --carver_start_endpoint=/api/v1/osquery/carve/begin --carver_continue_endpoint=/api/v1/osquery/carve/bloc
              k --carver_block_size=2000000 --tls_server_certs "C:\Program Files\Orbit\certs.pem" --force --flagfile "C
              :\Program Files\Orbit\osquery.flags"

CommandLine : "C:\Program Files\Orbit\bin\osqueryd\windows\stable\osqueryd.exe" "--pidfile=C:\Program Files\Orbit\osque
              ry.pid" "--database_path=C:\Program Files\Orbit\osquery.db" --extensions_socket=\\.\pipe\orbit-osquery-ex
              tension "--logger_path=C:\Program Files\Orbit\osquery_log" --enroll_secret_env ENROLL_SECRET --host_ident
              ifier=uuid --tls_hostname=<http://fleet.x-ops.net|fleet.x-ops.net> --enroll_tls_endpoint=/api/v1/osquery/enroll --config_plu
              gin=tls --config_tls_endpoint=/api/v1/osquery/config --config_refresh=60 --disable_distributed=false --di
              stributed_plugin=tls --distributed_tls_max_attempts=10 --distributed_tls_read_endpoint=/api/v1/osquery/di
              stributed/read --distributed_tls_write_endpoint=/api/v1/osquery/distributed/write --logger_plugin=tls,fil
              esystem --logger_tls_endpoint=/api/v1/osquery/log --disable_carver=false --carver_disable_function=false
              --carver_start_endpoint=/api/v1/osquery/carve/begin --carver_continue_endpoint=/api/v1/osquery/carve/bloc
              k --carver_block_size=2000000 --tls_server_certs "C:\Program Files\Orbit\certs.pem" --force --flagfile "C
              :\Program Files\Orbit\osquery.flags"
m
What are the contents of
C:\Program Files\Orbit\osquery.flags
g
empty file
m
Okay, I think I have something we can try. Delete the host with id
2088
using the Fleet UI. This should trigger osquery running on the host to do a reenrollment. I still haven’t figured out why there is a discrepancy with the osquery host id, but this should hopefully resolve the issue. We can confirm it’s resolved by waiting ~1 hr and running the following query on the fleet db
Copy code
select * from hosts where osquery_host_id = '59043F42-0ECC-2981-B532-AD4EEA9D2814' or uuid = '59043F42-0ECC-2981-B532-AD4EEA9D2814'
We should only get 1 row.
If this works, we may need to do the same for the rest of the affected hosts. I will likely need to create a github issue to follow up. There may be a bug with orbit enrollment and osquery enrollment
g
Yes, this works. Only 1 host was enrolled after deleting it
m
are you using orbit on all of your hosts? I’m trying to think of a way that we can do the same for all the affected hosts easily.
g
Yes, using Orbit with the generated deb / msi package
m
okay, give me a few minutes to come up with a sql query to delete the other hosts and related data. Can you confirm what version of fleet you are on?
g
fleet_v4.24.1_linux
m
are you using fleet premium ie are you using teams?
g
no
m
okay, I wrote an sql script that should delete the remaining affected hosts. This assumes that all hosts are using orbit, and that that you aren’t using teams. I would advise making a backup of your db and stopping fleet before running it. After starting fleet again, all affected hosts will reenroll.
I also created a github issue https://github.com/fleetdm/fleet/issues/9033. The repro steps are probably different than the way this bug happened for you, but I suspect the root cause is similar. If this problem happens again, please feel free to comment on issue, or mention it in slack.
g
looks good, thanks still about 10 such hosts in the database that re-appear after being deleted and also since the query there are about 20 hosts that are online but can't be fetched anymore, I'll look at fixing these tomorrow