Hey all, I'm getting an issue where I'm getting up...
# general
a
Hey all, I'm getting an issue where I'm getting upwards of 50% CPU usage by the osquery daemon on windows domain controllers. running the daemon in verbose, it happens right after the following messages (and hangs there while the CPU spikes):
Copy code
I0420 11:22:13.001909  6160 eventsubscriberplugin.cpp:492] Found 15 events for subscriber WindowsEventLogPublisher.powershell_events
I0420 11:22:13.533206  6160 events.cpp:70] Skipping subscriber: powershell_events: Required publisher is disabled by configuration
After which it gets stopped by the watchdog, and the whole thing repeats a few seconds later. Any insights?
f
@Ari Weinberg what's your query config look like?
I know i've seen machines grind to a halt when querying seemingly innocuous tables (eg.
users
) which can have thousands of rows when the device in question is a domain controller
a
I had that problem already, and disabled all those queries
f
👍 well at least i wasn't totally off the mark
a
Heres the current flags file:
Copy code
# Configuration
--config_plugin=tls
--config_tls_endpoint=/api/v1/osquery/config
--config_refresh=60

# Enable watchdog to stop queries that use too much resources
--disable_watchdog=false
--watchdog_level=1

# Live query
--disable_distributed=false
--distributed_plugin=tls
--distributed_interval=10
--distributed_tls_max_attempts=3
--distributed_tls_read_endpoint=/api/v1/osquery/distributed/read
--distributed_tls_write_endpoint=/api/v1/osquery/distributed/write

# Logging
--logger_plugin=tls,filesystem
--logger_tls_endpoint=/api/v1/osquery/log
--logger_tls_period=10

# File carving
--disable_carver=false
--carver_start_endpoint=/api/v1/osquery/carve/begin
--carver_continue_endpoint=/api/v1/osquery/carve/block
--carver_block_size=2000000

# Windows Configuration
#--enable_powershell_events_subscriber
#--enable_windows_events_publisher
#--enable_windows_events_subscriber
#--windows_event_channels=System,Application,Setup,Security

# Disable problamatic table
--disable_tables=chrome_extensions,google_chrome_profiles,mdm,munki_info
f
i wonder if you can slowly 50:50 your way through your scheduled query config to see where it is hanging
a
its getting the config from fleet:
Copy code
config:
  options:
    disable_tables: 'munki_info,google_chrome_profiles,mdm'
    pack_delimiter: _
    logger_tls_period: 10
    distributed_plugin: tls
    disable_distributed: false
    logger_tls_endpoint: /api/v1/osquery/log
    distributed_interval: 10
    distributed_tls_max_attempts: 3
f
the other option is run the queries as distributed or via osqueryi one by one and see which are not responding
a
But looking at the daemon in verbose mode, it doesnt look like its happening by a query
f
do we know that the last message is related to the cause of the hang, or is it the last thing that is working successfully pre-hang
a
Here's the full loop that's happening:
Copy code
I0420 12:09:09.901724  6804 eventsubscriberplugin.cpp:492] Found 15 events for subscriber WindowsEventLogPublisher.powershell_events
I0420 12:09:10.429062  6804 events.cpp:70] Skipping subscriber: powershell_events: Required publisher is disabled by configuration
I0420 12:10:04.662425  6288 config.cpp:1218] Refreshing configuration state
I0420 12:10:04.662425  6288 tls.cpp:255] TLS/HTTPS POST request to URI: <https://FLEET_URL/api/v1/osquery/config>
W0420 12:10:05.193722  6532 watcher.cpp:391] osqueryd worker (6908) stopping: Maximum sustainable CPU utilization limit exceeded: 57
I0420 12:10:05.224970  6532 watcher.cpp:656] osqueryd watcher (4528) executing worker (6712)
I0420 12:10:05.307375  6104 init.cpp:354] osquery worker initialized [watcher=4528]
I0420 12:10:05.307375  6104 dispatcher.cpp:78] Adding new service: WatcherWatcherRunner (00000216390B03F0) to thread: 6152 (000002163908E400) in process 6712
I0420 12:10:05.307375  6104 rocksdb.cpp:132] Opening RocksDB handle: \Program Files\osquery\osquery.db
I0420 12:10:05.635535  6104 dispatcher.cpp:78] Adding new service: ExtensionWatcher (000002163A95DEF0) to thread: 4888 (000002163AE92B40) in process 6712
I0420 12:10:05.635535  6104 dispatcher.cpp:78] Adding new service: ExtensionRunnerCore (000002163A94B260) to thread: 5196 (000002163AE92AE0) in process 6712
I0420 12:10:05.635535  6104 auto_constructed_tables.cpp:99] Removing stale ATC entries
I0420 12:10:05.635535  5196 interface.cpp:299] Extension manager service starting: \\.\pipe\osquery.em
I0420 12:10:05.635535  6104 dispatcher.cpp:78] Adding new service: ConfigRefreshRunner (000002163904C7F0) to thread: 6188 (000002163AF02500) in process 6712
I0420 12:10:05.635535  6104 tls.cpp:255] TLS/HTTPS POST request to URI: <https://FLEET_URL/api/v1/osquery/config>
I0420 12:10:10.446698  6104 eventfactory.cpp:156] Event publisher not enabled: ntfs_event_publisher: NTFS event publisher disabled via configuration
I0420 12:10:11.479414  6104 eventsubscriberplugin.cpp:492] Found 15 events for subscriber WindowsEventLogPublisher.powershell_events
I0420 12:10:11.932579  6104 events.cpp:70] Skipping subscriber: powershell_events: Required publisher is disabled by configuration
f
i am admittedly out of my depth in terms of the cause here and any suggestions would be blind stabs
a
@fritz Really appreciate your help
f
sorry i couldn't be of greater assistance, the folks over in #fleet will no doubt be a good resource though if you are having fleet related trouble
s
Yeah I would maybe check how many events osquery is getting, looking at the
osquery_events
table, and or also at the size of the RocksDB database
normally that's the second possible cause after a query that cases high CPU usage. Either a lot of events in general, so the publisher has to work a lot, and/or the RocksDB database that has a lot of data to compact, expire and so on.
a
I backed up and deleted the .db folder, and that seems to have solved it
🎉 1
Whats the long term solution then to manage the db?
s
Seeing you have some events active, would be useful to know the interval of your queries on evented tables
normally it's better to have a low interval so that the number of keys in the RocksDB database doesn't grow too much. If they do, then there are more keys to expire when the query runs
a
In other words, you want to know how often im querying
events
tables?
s
yeah
a
Im not using the eventing framework...
This is just running on a DC, so there are a lot of logs
s
oh my bad, I think I was a victim of word wrapping in the slim Slack thread, I see know that they are commented out.
a
Oh my bad on your bad, those were uncommented untill this morning, when I started troubleshooting
but this still happened for a while after I commented those out (until i just deleted the db this morning)
s
ah so, another thing to be careful about is that if you're not querying the evented tables but still receiving events, those will continuosly accumulate and then at osquery startup the expiration will start
a
What do you mean by experation?
s
that been said, one thing you could check is the content of the db by starting osquery with your backed up db and pass
--database_dump
events collected by osquery won't stay indefinitely in the database to be queried. There's an expiration that is calculated per table/subscriber and based on the interval of the scheduled queries that are done on them.
also, what version of osquery are you using?
a
5.2.1.0
what was
--database_dump
supposed to show me? I just got an infinite scroll
s
What osquery was likely working on. If there are a lot of events, then it was likely working on expiring them. There also can be a lot of logs to be sent via TLS or another method, but normally that's not cause for high cpu usage.
a
So to be clear, keeping the events disabled should solve this in the long term?
Becasue this all seems to come from the windowseventlogpublisher
s
Well, it depends; it's not said that you can't use events, it depends on the rate of those and if your machine is capable, but most of all I would say the interval of the scheduled queries on those.
a
I dont think Im querying those tables at all, but ill check.
s
If you're not using them, don't enable them. If you're using them, query the evented table often (less than a minute),
a
Do any of the query packs that come with osquery use those tables?
s
Nope
a
Okay then im not using them, and im going to disable them.
👍 1
s
at least not Windows tables. There's one for
hardware_events
which is on Linux, and then the
osquery_events
table which tracks the status of publisher and subscribers but that's it
a
@Stefano Bonicatti Thanks so much for your prompt help and insight!