We’re seeing an issue where new hosts are unable t...
# core
c
We’re seeing an issue where new hosts are unable to register with our tls_server (Fleet) upon initial startup, but once we stop osqueryd, run an
osqueryctl clean
, and start it up again, it registers fine. So something about the rocksdb is jamming up the tls server enrollment process.
Here's the logs when I start up a new instance:
Copy code
-- Logs begin at Fri 2021-02-12 16:45:58 UTC. --
Feb 12 16:47:02 ip-10-254-128-197 systemd[1]: Starting The osquery Daemon...
Feb 12 16:47:02 ip-10-254-128-197 systemd[1]: Started The osquery Daemon.
Feb 12 16:47:03 ip-10-254-128-197 osqueryd[668]: osqueryd started [version=4.3.0]
Feb 12 16:47:04 ip-10-254-128-197 osqueryd[668]: I0212 16:47:04.333811   711 events.cpp:863] Event publisher not enabled: auditeventpublisher: Publisher disabled via configuration
Feb 12 16:47:04 ip-10-254-128-197 osqueryd[668]: I0212 16:47:04.333986   711 events.cpp:863] Event publisher not enabled: syslog: Publisher disabled via configuration
Feb 12 16:47:04 ip-10-254-128-197 osqueryd[668]: I0212 16:47:04.730011   737 interface.cpp:108] Registering extension (braintree, 13617, version=, sdk=)
Nothing after that. When I turn on verbose logging and restart:
Copy code
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: osqueryd started [version=4.3.0]
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.899641 19315 virtual_table.cpp:1069] Table arp_cache is disabled, not attaching
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.899675 19315 virtual_table.cpp:1069] Table atom_packages is disabled, not attaching
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.899696 19315 virtual_table.cpp:1069] Table augeas is disabled, not attaching
... a bunch more disabled tables...
Copy code
eb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.912704 19315 system.cpp:335] Found stale process for osqueryd (668)
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.912752 19315 system.cpp:367] Writing osqueryd pid (19315) to /var/run/osqueryd.pidfile
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.912873 19315 extensions.cpp:434] Found autoloadable extension: /usr/lib/osquery/extensions/bt_osquery.ext
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.912947 19315 dispatcher.cpp:77] Adding new service: WatcherRunner (0x3742008) to thread: 140661741872896 (0x3795370) in proce
ss 19315
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.913702 19331 watcher.cpp:583] osqueryd watcher (19315) executing worker (19332)
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.914024 19331 watcher.cpp:629] Created and monitoring extension child (19333): /usr/lib/osquery/extensions/bt_osquery.ext
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.924050 19332 init.cpp:340] osquery worker initialized [watcher=19315]
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.924191 19332 dispatcher.cpp:77] Adding new service: WatcherWatcherRunner (0x3ac1168) to thread: 140707453564672 (0x3ac19a0) i
n process 19332
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.924352 19332 rocksdb.cpp:131] Opening RocksDB handle: /var/osquery/osquery.db
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.943020 19332 dispatcher.cpp:77] Adding new service: ExtensionWatcher (0x3ac91a8) to thread: 140707213862656 (0x3bcbb50) in pr
ocess 19332
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.943116 19332 dispatcher.cpp:77] Adding new service: ExtensionRunnerCore (0x3bcc158) to thread: 140707205469952 (0x3ac92b0) in
 process 19332
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.943172 19332 auto_constructed_tables.cpp:96] Removing stale ATC entries
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.943359 19349 interface.cpp:268] Extension manager service starting: /var/osquery/osquery.em
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.943507 19332 dispatcher.cpp:77] Adding new service: ConfigRefreshRunner (0x3bceee8) to thread: 140707197077248 (0x3bcc8b0) in
 process 19332
Feb 12 17:02:57 creisor-test2 osqueryd[19315]: I0212 17:02:57.943792 19332 tls.cpp:253] TLS/HTTPS POST request to URI: <https://redacted/api/v1/osquery/config>
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.031823 19332 system.cpp:294] Using host identifier: b2627174-59cd-4378-832c-cd49de707a64
Feb 12 17:02:58 creisor-test2 osqueryd[19332]: severity=0 location=init.cpp:340 message=osquery worker initialized [watcher=19315]
Feb 12 17:02:58 creisor-test2 osqueryd[19332]: severity=0 location=dispatcher.cpp:77 message=Adding new service: WatcherWatcherRunner (0x3ac1168) to thread: 140707453564672 (0x3ac19a0) in
 process 19332
Feb 12 17:02:58 creisor-test2 osqueryd[19332]: severity=0 location=rocksdb.cpp:131 message=Opening RocksDB handle: /var/osquery/osquery.db
Feb 12 17:02:58 creisor-test2 osqueryd[19332]: severity=0 location=dispatcher.cpp:77 message=Adding new service: ExtensionWatcher (0x3ac91a8) to thread: 140707213862656 (0x3bcbb50) in pro
cess 19332
Feb 12 17:02:58 creisor-test2 osqueryd[19332]: severity=0 location=dispatcher.cpp:77 message=Adding new service: ExtensionRunnerCore (0x3bcc158) to thread: 140707205469952 (0x3ac92b0) in
process 19332
Feb 12 17:02:58 creisor-test2 osqueryd[19332]: severity=0 location=auto_constructed_tables.cpp:96 message=Removing stale ATC entries
Feb 12 17:02:58 creisor-test2 osqueryd[19332]: severity=0 location=interface.cpp:268 message=Extension manager service starting: /var/osquery/osquery.em
Feb 12 17:02:58 creisor-test2 osqueryd[19332]: severity=0 location=dispatcher.cpp:77 message=Adding new service: ConfigRefreshRunner (0x3bceee8) to thread: 140707197077248 (0x3bcc8b0) in
process 19332
Feb 12 17:02:58 creisor-test2 osqueryd[19332]: severity=0 location=tls.cpp:253 message=TLS/HTTPS POST request to URI: <https://redacted/api/v1/osquery/config>
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.119488 19354 interface.cpp:108] Registering extension (braintree, 16168, version=, sdk=)
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.119809 19354 registry_factory.cpp:106] Extension 16168 registered table plugin bt_env
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.324939 19332 events.cpp:863] Event publisher not enabled: auditeventpublisher: Publisher disabled via configuration
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.325028 19332 events.cpp:863] Event publisher not enabled: syslog: Publisher disabled via configuration
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.325161 19332 events.cpp:1122] Error registering subscriber: process_file_events: Subscriber disabled via configuration
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.325189 19332 events.cpp:1122] Error registering subscriber: selinux_events: Subscriber disabled via configuration
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.325208 19332 events.cpp:1122] Error registering subscriber: socket_events: Subscriber disabled via configuration
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.325446 19355 events.cpp:784] Starting event publisher run loop: inotify
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.325469 19356 events.cpp:784] Starting event publisher run loop: udev
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.325531 19332 dispatcher.cpp:77] Adding new service: DistributedRunner (0x3ded048) to thread: 140707071252224 (0x3e03320) in process 19332
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.325594 19332 dispatcher.cpp:77] Adding new service: SchedulerRunner (0x3e313f8) to thread: 140707062859520 (0x3cbb830) in process 19332
Feb 12 17:02:58 creisor-test2 osqueryd[19315]: I0212 17:02:58.325811 19357 tls.cpp:253] TLS/HTTPS POST request to URI: <https://redacted/api/v1/osquery/distributed/read>
That POST request will continue to happen, but the instance remains un-registered.
One key difference between the verbose logs before I delete the db file and after is this line:
Copy code
Feb 12 17:17:36 creisor-test2 osqueryd[24471]: I0212 17:17:36.620189 24492 tls_enroll.cpp:69] TLSEnrollPlugin requesting a node enroll key from: <https://redacted/api/v1/osquery/enroll>
t
@Chris Reisor do these new hosts have an existing
/var/osquery/osquery.db
? They should not but perhaps some deployment software is copying over an existing DB from a golden image (or something like that)?
The enroll step typically happens once and caches an enroll key in the local DB. Enrollment is skills (unless explicitly always required) and the key exists locally.
I am not dismissing the potential of a bug, but I want to understand the scenario a bit more.
c
I'm not ready to assume a bug either yet πŸ™‚ Oddly enough I just now happened to be spinning up an instance. The db file seems to have been created after the service started up the first time. I have two theories I can try first.
the cached enroll key is a good bit of info I didn't know
t
This might be helpful for debugging. If you start the service for the first time, then shut it down (confirm no
osqueryd
process is still running) You can run:
sudo osqueryi --disable_extensions --database_path /var/osquery/osquery.db --database_dump | grep nodeKey
to check if a
nodeKey
was provided by the Fleet service.
Keep in mind that the persistent storage (the osquery.db file) does not allow multiple processes to open/read concurrently.
c
that's great, thank you! that'll keep me busy the rest of the afternoon πŸ™‚
Bingo! Because osqueryd was being enabled in our AMI configuration, it was starting up and registering. It caches that registration in the DB file, and then an instance is spun up from that AMI with the DB file already existing, but with an instance id that doesn't match anything on the tls server side.
Should be an easy fix: install osqueryd but inactive at AMI creation time, enable the daemon at instance creation time by adding it to the user-data script. (This is all AWS talk in case it's unclear)
Thanks so much
t
hehe yeah, a bit over my head πŸ˜›
but LMK if the workflow is too complex and I can see if there are osquery "tricks" that will make it easier
c
in a nutshell, your first guess was correct πŸ™‚
πŸŽ‰ 1
t
--*enroll*_always
is a flag, it does what it says, but this may mess up Fleet
(as an example of a potential "trick")
c
I see. I think the proper fix is to avoid starting the daemon in the image; only at runtime
t
yeap