mysql hosts table looks like this for these hosts all other osquery #fleet

mysql hosts table looks like this for these hosts....

Gregory Storme

12/12/2022, 11:41 AM

mysql hosts table looks like this for these hosts. all other fields are empty

Copy code

id: 241686
     osquery_host_id: FD723F42-2034-8947-200D-9D1902CF7058
          created_at: 2022-12-12 12:12:43
          updated_at: 2022-12-12 12:12:43
   detail_updated_at: 1970-01-02 01:00:00
            node_key: 86UbEE48jOqr3epoiDHg4nLhEwgq390m
              uptime: 0
              memory: 0
  cpu_physical_cores: 0
   cpu_logical_cores: 0
       primary_ip_id: NULL
distributed_interval: 0
   logger_tls_period: 0
  config_tls_refresh: 0
    label_updated_at: 1970-01-02 01:00:00
    last_enrolled_at: 1970-01-02 01:00:00
   refetch_requested: 1
             team_id: NULL
   policy_updated_at: 1970-01-02 01:00:00
      orbit_node_key: 86UbEE48jOqr3epoiDHg4nLhEwgq390m

Michal Nicpon

12/12/2022, 12:03 PM

Hmm, this definitely shouldn’t be happening. Are you using orbit or plain osquery?

Gregory Storme

12/12/2022, 12:04 PM

orbit

Michal Nicpon

12/12/2022, 12:05 PM

I may need to check with the rest of the team and get back to you.

Michal Nicpon

12/12/2022, 12:06 PM

Do you have host expiry enabled? I think these hosts will eventually be cleaned up, but it’s still an issue we should look into

Michal Nicpon

12/12/2022, 12:07 PM

To check, go to Settings -> Organization Settings -> Advanced options and see if the checkbox for Host expiry is checked

Gregory Storme

12/12/2022, 12:07 PM

no i don't use the host expiration

Gregory Storme

12/12/2022, 12:07 PM

i can just delete those from the db as well i guess

Gregory Storme

12/12/2022, 12:08 PM

all with 1970-* as date

Michal Nicpon

12/12/2022, 12:09 PM

before you do, do you mind making a backup? we may need additional info for troubleshooting/debugging

Michal Nicpon

12/12/2022, 12:09 PM

Also, do you know if the

created_at

field is the same for all these affected hosts. Does it correspond with the time you upgraded fleet?

Gregory Storme

12/12/2022, 12:11 PM

the created_at date keeps changing to the current day/time for all those hosts

Gregory Storme

12/12/2022, 12:11 PM

I upgraded last friday

Michal Nicpon

12/12/2022, 12:13 PM

Is the total number of hosts increasing from last friday or staying constant? The created_at field should never be updated as far as I know

Gregory Storme

12/12/2022, 12:15 PM

looks like hosts are being removed and new ones are being added the host with id "241686" from my original post does not exist anymore, but a new host was added (not a real one)

Michal Nicpon

12/12/2022, 12:23 PM

Are there any errors in the logs, potentially containing

/api/fleet/orbit/enroll

Gregory Storme

12/12/2022, 12:34 PM

there are errors, but the same errors as before the upgrade to 4.24.1, and nothing with enroll in the uri

Michal Nicpon

12/12/2022, 12:39 PM

One more thing before I take this back to the team. Can you do a

select * from hosts where osquery_host_id = "FD723F42-2034-8947-200D-9D1902CF7058"

Gregory Storme

12/12/2022, 12:40 PM

Copy code

id: 243411
     osquery_host_id: FD723F42-2034-8947-200D-9D1902CF7058
          created_at: 2022-12-12 13:12:56
          updated_at: 2022-12-12 13:12:56
   detail_updated_at: 1970-01-02 01:00:00
            node_key: +hDXpX6Z+97AtEdRIL9X1o7W3iiUOfvv
            hostname:
                uuid:
            platform:
     osquery_version:
          os_version:
               build:
       platform_like:
           code_name:
              uptime: 0
              memory: 0
            cpu_type:
         cpu_subtype:
           cpu_brand:
  cpu_physical_cores: 0
   cpu_logical_cores: 0
     hardware_vendor:
      hardware_model:
    hardware_version:
     hardware_serial:
       computer_name:
       primary_ip_id: NULL
distributed_interval: 0
   logger_tls_period: 0
  config_tls_refresh: 0
          primary_ip:
         primary_mac:
    label_updated_at: 1970-01-02 01:00:00
    last_enrolled_at: 1970-01-02 01:00:00
   refetch_requested: 1
             team_id: NULL
   policy_updated_at: 1970-01-02 01:00:00
           public_ip:
      orbit_node_key: +hDXpX6Z+97AtEdRIL9X1o7W3iiUOfvv

Michal Nicpon

12/12/2022, 12:42 PM

based on the ids and the created_at timestamps, it seems like there are quite a few hosts being created/deleted over a relatively short period of time. Are you experiencing any other issues ie degraded performance?

Gregory Storme

12/12/2022, 12:45 PM

no other issues. in the last hour, the count for such hosts went from 1724 to 1725.

Gregory Storme

12/12/2022, 12:46 PM

we have 2551 real hosts in fleet

Michal Nicpon

12/12/2022, 12:48 PM

This may be a little hard for you to find out, but are you able to identify which hosts are having issues? orbit/osquery logs from one of these hosts may be helpful

Gregory Storme

12/12/2022, 1:10 PM

looks like all of these hosts are being deleted and added every 2 hours

Michal Nicpon

12/12/2022, 1:11 PM

yes, we have a cleanup job that periodically cleans up hosts. This explains why the same hosts appear to be recreated, but with new ids

Gregory Storme

12/12/2022, 1:20 PM

ok. I don't have a way to check the logs on those 2551 hosts, we don't have centralized logging for the osquery/orbit logs

Gregory Storme

12/12/2022, 1:21 PM

when deleting these hosts, they are instantly being recreated

Michal Nicpon

12/12/2022, 2:47 PM

okay, we will look in this further and get back to you later

Michal Nicpon

12/13/2022, 9:06 AM

ok. I don’t have a way to check the logs on those 2551 hosts, we don’t have centralized logging for the osquery/orbit logs

Do you mean you don’t any access at all to these hosts? The logs for even one of the affected hosts would really help here. See docs for where to find orbit logs on various platforms.

Gregory Storme

12/13/2022, 9:09 AM

I have access to all hosts, but I don't know how to identify an affected host

Michal Nicpon

12/13/2022, 9:34 AM

Okay, there’s a couple ways I can think of, but may require some trial and error. 1. Run

osqueryi

on a host and execute the following sql query

select uuid from osquery_info

. See if it matches the

osquery_host_id

of an affected host in the fleet db 2. Run the following command (osx/linux)

ps -eo pid,lstart,command | grep osquery

and see if osquery is either not running, or has a start time that is very recent ie less than 1 hour ago.

Gregory Storme

12/13/2022, 10:25 AM

I ran

osqueryd -S --json "select uuid from osquery_info"

on all of our windows hosts, and 612 uuid's from that query match with an osquery_host_id from those offline hosts in the fleet db

Gregory Storme

12/13/2022, 10:35 AM

orbit was auto updated on these hosts on 2022-12-06 and I notice this in the logs, could this be related? INF initial flags update failed error="error getting flags from fleet: unauthenticated, or invalid token"

Michal Nicpon

12/13/2022, 12:48 PM

seems likely, looking into it

Michal Nicpon

12/13/2022, 12:48 PM

did you also check if osquery was running using 2 above?

Michal Nicpon

12/13/2022, 12:50 PM

on one of the hosts you found that was affected

Gregory Storme

12/13/2022, 12:54 PM

so far i've only checked the windows hosts, and the service was running since Dec 6th on most of them need to check for linux still

Michal Nicpon

12/13/2022, 12:58 PM

On windows you can run the following in powershell

Get-Process osqueryd | select name,starttime

Michal Nicpon

12/13/2022, 12:59 PM

Just confirming that osqueryd is not running, and that the issue is limited to orbit

Michal Nicpon

12/13/2022, 1:03 PM

Do you see any errors in the fleet logs containing

/api/fleet/orbit/config

Gregory Storme

12/13/2022, 1:05 PM

Both osqueryd & orbit processes are running, but this is also the case on non-problem hosts with a uuid that does not appear in the list of problematic fleetdb uuid's

Gregory Storme

12/13/2022, 1:05 PM

yes, lots of these fleet[2878872]: {"component":"http","err":": Authentication required","internal":"authentication error: invalid orbit node key","level":"info","path":"/api/fleet/orbit/config","ts":"2022-12-13T130518.614380037Z"}

Gregory Storme

12/13/2022, 1:08 PM

these logs match when there's an http 401, I've asked about this 2 months ago here: https://osquery.slack.com/archives/C01DXJL16D8/p1666699093295139

Michal Nicpon

12/13/2022, 1:28 PM

These two issues might be related.

Michal Nicpon

12/13/2022, 1:29 PM

Can you check if

C:\Program Files\Orbit\secret-orbit-node-key.txt

exists and is not empty?

Gregory Storme

12/13/2022, 1:35 PM

yes, exists and not empty

Michal Nicpon

12/13/2022, 1:44 PM

Can you verify that it’s valid using the following sql query in fleet’s db

Copy code

select id from hosts where orbit_node_key = "secret"

Replacing secret with the value from the above file

Gregory Storme

12/13/2022, 1:47 PM

and do this for all hosts with a matching uuid in the previous step probably?

Michal Nicpon

12/13/2022, 1:51 PM

let’s limit it to 1 or 2 hosts for now

Gregory Storme

12/13/2022, 1:54 PM

When I run that orbit_node_key query, I get one of those problem hosts

Gregory Storme

12/13/2022, 1:55 PM

And the orbit_node_key for that real host is NULL in the fleetdb

Michal Nicpon

12/13/2022, 1:57 PM

can you run

select * from hosts

for the problem host and the real host and paste the output?

Gregory Storme

12/13/2022, 1:59 PM

Copy code

*************************** 1. row ***************************
                  id: 2088
     osquery_host_id: 89f17842-22be-4b0e-98ac-67efc907a9ba
          created_at: 2022-03-21 09:55:53
          updated_at: 2022-12-13 14:56:24
   detail_updated_at: 2022-12-13 14:56:24
            node_key: 4gEbEsUgRIX5yTYrQ7hLhplDbdpOT87U
            hostname: MASKED-web01
                uuid: 59043F42-0ECC-2981-B532-AD4EEA9D2814
            platform: windows
     osquery_version: 5.6.0
          os_version: Windows Server 2012 R2 Standard
               build: 9600
       platform_like: windows
           code_name: Microsoft Windows Server 2012 R2 Standard
              uptime: 2896149000000000
              memory: 4294967296
            cpu_type: x86_64
         cpu_subtype: -1
           cpu_brand: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
  cpu_physical_cores: 2
   cpu_logical_cores: 2
     hardware_vendor: VMware, Inc.
      hardware_model: VMware Virtual Platform
    hardware_version: -1
     hardware_serial: VMware-42 3f 04 59 cc 0e 81 29-b5 32 ad 4e ea 9d 28 14
       computer_name: MASKED-WEB01
       primary_ip_id: NULL
distributed_interval: 10
   logger_tls_period: 10
  config_tls_refresh: 60
          primary_ip: MASKED
         primary_mac: 00:50:56:bf:b6:20
    label_updated_at: 2022-12-13 14:55:34
    last_enrolled_at: 2022-03-21 09:55:53
   refetch_requested: 0
             team_id: NULL
   policy_updated_at: 2022-12-13 05:40:55
           public_ip: MASKED
      orbit_node_key: NULL


*************************** 1. row ***************************
                  id: 290414
     osquery_host_id: 59043F42-0ECC-2981-B532-AD4EEA9D2814
          created_at: 2022-12-13 14:08:47
          updated_at: 2022-12-13 14:08:47
   detail_updated_at: 1970-01-02 01:00:00
            node_key: 0onOi8pacBhK/lxzUd1NKLRd4JBP02Kf
            hostname:
                uuid:
            platform:
     osquery_version:
          os_version:
               build:
       platform_like:
           code_name:
              uptime: 0
              memory: 0
            cpu_type:
         cpu_subtype:
           cpu_brand:
  cpu_physical_cores: 0
   cpu_logical_cores: 0
     hardware_vendor:
      hardware_model:
    hardware_version:
     hardware_serial:
       computer_name:
       primary_ip_id: NULL
distributed_interval: 0
   logger_tls_period: 0
  config_tls_refresh: 0
          primary_ip:
         primary_mac:
    label_updated_at: 1970-01-02 01:00:00
    last_enrolled_at: 1970-01-02 01:00:00
   refetch_requested: 1
             team_id: NULL
   policy_updated_at: 1970-01-02 01:00:00
           public_ip:
      orbit_node_key: 0onOi8pacBhK/lxzUd1NKLRd4JBP02Kf

Michal Nicpon

12/13/2022, 2:04 PM

And the orbit_node_key for that real host is NULL in the fleetdb

just to clarify, is the above output from the same host? How do you know it’s the same host?

Gregory Storme

12/13/2022, 2:08 PM

Because the orbit_node_key from the host with id 290414 matches the secret in the secret-orbit-node-key.txt file, but all the other host details match with host id 2088

Michal Nicpon

12/13/2022, 2:15 PM

Did you by any chance change ie add/remove and enroll secrets in fleet?

Gregory Storme

12/13/2022, 7:46 PM

no, only 1 enroll secret and it hasn't changed since we deployed fleet

Raghavendra Hiremath

12/14/2022, 6:01 AM

Hello All, Once Install osquery agent on node machines, I see this log message where I can't see the RAM data. Error system_info.cpp:73] Got error trying to determine the physically installed memory: SMBIOS data is malformed

Michal Nicpon

12/14/2022, 10:32 AM

@Raghavendra Hiremath is this related to the issue being discussed in this thread? If not, would you mind messaging in the #fleet channel?

Michal Nicpon

12/14/2022, 10:46 AM

Can you run the following osquery query on the affected host

Copy code

select * from osquery_flags where name = 'host_identifier'

Gregory Storme

12/14/2022, 10:50 AM

MASKED-WEB01,"hostname","Field used to identify the host running osquery (hostname, uuid, instance, ephemeral, specified)","host_identifier","0","string","uuid"

Michal Nicpon

12/14/2022, 10:59 AM

One thing I noticed was that the

osquery_host_id

and the

uuid

for the real host don’t match. Trying to figure out why that is.

Michal Nicpon

12/14/2022, 11:10 AM

Can you go to the Fleet UI -> Settings -> Organization Settings -> Agent Options and see if there is anything set for the

command_line_flags

Gregory Storme

12/14/2022, 11:11 AM

no we haven't used this

command_line_flags: {} # requires Fleet's osquery installer

Michal Nicpon

12/14/2022, 11:19 AM

do you mind pasting the whole yaml actually

Gregory Storme

12/14/2022, 11:20 AM

Copy code

config:
  options:
    logger_plugin: tls
    disable_carver: true
    disable_tables: 'chrome_extensions,firefox_addons'
    logger_tls_period: 10
    distributed_plugin: tls
    disable_distributed: false
    logger_tls_endpoint: /api/osquery/log
    distributed_interval: 10
    carver_disable_function: true
    distributed_tls_max_attempts: 3
command_line_flags: {} # requires Fleet's osquery installer

Michal Nicpon

12/14/2022, 11:47 AM

On the affected host, can you run the following in powershell (may require admin)

Copy code

Get-CimInstance Win32_process -Filter "name ='osqueryd.exe'" | select CommandLine

Gregory Storme

12/14/2022, 11:53 AM

Copy code

CommandLine : "C:\Program Files\Orbit\bin\osqueryd\windows\stable\osqueryd.exe" "--pidfile=C:\Program Files\Orbit\osque
              ry.pid" "--database_path=C:\Program Files\Orbit\osquery.db" --extensions_socket=\\.\pipe\orbit-osquery-ex
              tension "--logger_path=C:\Program Files\Orbit\osquery_log" --enroll_secret_env ENROLL_SECRET --host_ident
              ifier=uuid --tls_hostname=<http://fleet.x-ops.net|fleet.x-ops.net> --enroll_tls_endpoint=/api/v1/osquery/enroll --config_plu
              gin=tls --config_tls_endpoint=/api/v1/osquery/config --config_refresh=60 --disable_distributed=false --di
              stributed_plugin=tls --distributed_tls_max_attempts=10 --distributed_tls_read_endpoint=/api/v1/osquery/di
              stributed/read --distributed_tls_write_endpoint=/api/v1/osquery/distributed/write --logger_plugin=tls,fil
              esystem --logger_tls_endpoint=/api/v1/osquery/log --disable_carver=false --carver_disable_function=false
              --carver_start_endpoint=/api/v1/osquery/carve/begin --carver_continue_endpoint=/api/v1/osquery/carve/bloc
              k --carver_block_size=2000000 --tls_server_certs "C:\Program Files\Orbit\certs.pem" --force --flagfile "C
              :\Program Files\Orbit\osquery.flags"

CommandLine : "C:\Program Files\Orbit\bin\osqueryd\windows\stable\osqueryd.exe" "--pidfile=C:\Program Files\Orbit\osque
              ry.pid" "--database_path=C:\Program Files\Orbit\osquery.db" --extensions_socket=\\.\pipe\orbit-osquery-ex
              tension "--logger_path=C:\Program Files\Orbit\osquery_log" --enroll_secret_env ENROLL_SECRET --host_ident
              ifier=uuid --tls_hostname=<http://fleet.x-ops.net|fleet.x-ops.net> --enroll_tls_endpoint=/api/v1/osquery/enroll --config_plu
              gin=tls --config_tls_endpoint=/api/v1/osquery/config --config_refresh=60 --disable_distributed=false --di
              stributed_plugin=tls --distributed_tls_max_attempts=10 --distributed_tls_read_endpoint=/api/v1/osquery/di
              stributed/read --distributed_tls_write_endpoint=/api/v1/osquery/distributed/write --logger_plugin=tls,fil
              esystem --logger_tls_endpoint=/api/v1/osquery/log --disable_carver=false --carver_disable_function=false
              --carver_start_endpoint=/api/v1/osquery/carve/begin --carver_continue_endpoint=/api/v1/osquery/carve/bloc
              k --carver_block_size=2000000 --tls_server_certs "C:\Program Files\Orbit\certs.pem" --force --flagfile "C
              :\Program Files\Orbit\osquery.flags"

Michal Nicpon

12/14/2022, 11:57 AM

What are the contents of

C:\Program Files\Orbit\osquery.flags

Gregory Storme

12/14/2022, 11:58 AM

empty file

Michal Nicpon

12/14/2022, 12:09 PM

Okay, I think I have something we can try. Delete the host with id

using the Fleet UI. This should trigger osquery running on the host to do a reenrollment. I still haven’t figured out why there is a discrepancy with the osquery host id, but this should hopefully resolve the issue. We can confirm it’s resolved by waiting ~1 hr and running the following query on the fleet db

Copy code

select * from hosts where osquery_host_id = '59043F42-0ECC-2981-B532-AD4EEA9D2814' or uuid = '59043F42-0ECC-2981-B532-AD4EEA9D2814'

We should only get 1 row.

Michal Nicpon

12/14/2022, 12:12 PM

If this works, we may need to do the same for the rest of the affected hosts. I will likely need to create a github issue to follow up. There may be a bug with orbit enrollment and osquery enrollment

Gregory Storme

12/15/2022, 1:16 PM

Yes, this works. Only 1 host was enrolled after deleting it

Michal Nicpon

12/15/2022, 1:21 PM

are you using orbit on all of your hosts? I’m trying to think of a way that we can do the same for all the affected hosts easily.

Gregory Storme

12/15/2022, 1:22 PM

Yes, using Orbit with the generated deb / msi package

Michal Nicpon

12/15/2022, 1:31 PM

okay, give me a few minutes to come up with a sql query to delete the other hosts and related data. Can you confirm what version of fleet you are on?

Gregory Storme

12/15/2022, 1:35 PM

fleet_v4.24.1_linux

Michal Nicpon

12/15/2022, 1:40 PM

are you using fleet premium ie are you using teams?

Gregory Storme

12/15/2022, 1:41 PM

Michal Nicpon

12/15/2022, 3:58 PM

okay, I wrote an sql script that should delete the remaining affected hosts. This assumes that all hosts are using orbit, and that that you aren’t using teams. I would advise making a backup of your db and stopping fleet before running it. After starting fleet again, all affected hosts will reenroll.

cleanup_hosts.sql

Michal Nicpon

12/15/2022, 4:06 PM

I also created a github issue https://github.com/fleetdm/fleet/issues/9033. The repro steps are probably different than the way this bug happened for you, but I suspect the root cause is similar. If this problem happens again, please feel free to comment on issue, or mention it in slack.

Gregory Storme

12/19/2022, 5:18 PM

looks good, thanks still about 10 such hosts in the database that re-appear after being deleted and also since the query there are about 20 hosts that are online but can't be fetched anymore, I'll look at fixing these tomorrow

Open in Slack

Previous Next