https://github.com/osquery/osquery logo
Title
p

packetzero

02/08/2019, 2:50 PM
@shed7 I think we have had some machines with this issue. Nice work on diagnosing the workaround. This reinforces my view that we need to have maximum sizes on the rocksdb indexes and data.
s

shed7

02/08/2019, 2:56 PM
Cheers, I spoke too soon though. one of the machines is back up to 100% on 1 core but the worker is getting killed every 16 seconds or so
p

packetzero

02/08/2019, 2:57 PM
have you run sst_dump on the .sst files to see which tables / indexes are not being cleaned up?
s

shed7

02/08/2019, 2:58 PM
have not heard of sst_dump before
p

packetzero

02/08/2019, 2:58 PM
it's not part of the osquery stuff, it's a tool in the rocksdb distro / build
s

shed7

02/08/2019, 2:59 PM
what tables & indexes should be cleaned up when osqueryd stops ?
p

packetzero

02/08/2019, 2:59 PM
I'm thinking that you have an events table being populated, but maybe not queries in your schedule.
so it never gets removed
(the data never gets deleted)
which should happen every 5-minutes or so, and after restart
s

shed7

02/08/2019, 3:01 PM
atm I have these flags:
--audit_force_reconfigure=true
--audit_debug
--logger_min_status=1
--disable_events=false
--logger_plugin=syslog
--host_identifier=hostname
--schedule_splay_percent=10
--disable_audit=false
--audit_allow_config=true
--audit_persist=true
--audit_allow_process_events=true
--events_expiry=120
--events_max=50000
--audit_allow_sockets=true
--verbose
--watchdog_delay=120
--disable_extensions=true
and scheduled queries:
"file_events": {
      "query": "SELECT * from file_events;",
      "removed": false,
      "interval": 30
"socket_events":{
      "query": "SELECT s.action, s.auid, s.family, s.local_address, s.local_port, s.path, s.pid, s.remote_address, s.remote_port, s.success, s.time, p.cmdline, p.cmdline_size, p.parent, p.uid, p.euid FROM socket_events s JOIN process_events p ON p.pid = s.pid WHERE s.action='bind';",
      "removed": false,
      "interval": 60
    },
    "process_events": {
      "query": "SELECT auid, cmdline, ctime, cwd, egid, euid, gid, parent, path, pid, time, uid FROM process_events WHERE path NOT IN ('/bin/date', '/bin/mktemp', '/usr/bin/dirname', '/usr/bin/head', '/bin/uname', '/bin/basename');",
      "removed": false,
      "interval": 30
not actual config just pasted here
the scheduled queries on the event tables are the same across the 50 or so boxes running osqueryd, but there are 6 or 7 which this problem. all same OS.
p

packetzero

02/08/2019, 3:09 PM
no query for process_socket_events ?
I mean socket_events
oh, I see it
IF you have only one query per event table, you can set events_expiry=1 , which means it clears after access
I recommend getting sst_dump and running it on the files to see what's in the DB
s

shed7

02/08/2019, 3:14 PM
just compiling now, thanks for the tips
p

packetzero

02/08/2019, 3:15 PM
np
s

shed7

02/08/2019, 4:43 PM
some old events in there, and 38k lines like this:
data.auditeventpublisher.process_events.0003350252' seq:6482196, type:0 =>
I've added events_expiry=1 and it lasted a few hours, but is now stuck in that same loop. this particular box generates about 450 process_events in a 30 second window
p

packetzero

02/09/2019, 7:43 PM
If you start with a new empty database, how long does it take to get into this high cpu loop?
It seems like the rocksdb indexes are never deleted (but data is), and it means osquery is wasting time processing all the unnecessary indexes.
@shed7 I added a comment in https://github.com/facebook/osquery/issues/4333 . If you can try adding that LOG line, and let me know if it's happening, that would help. It's a tough one to reproduce, as devs who have looked into this issue see that all works fine for them.
u

钢铁侠

02/11/2019, 3:42 AM
@packetzero This is me who called spoock1024 in this issue https://github.com/facebook/osquery/pull/5335
this scenario only happens in my specific machines
have you reproduce this scenario in your machines? @packetzero
even as you said
Indexes are only being deleted if all queries have been completed. But I don't think the query results are posted yet
how can we avoid this?
@packetzero
p

packetzero

02/11/2019, 4:54 AM
I am unable to reproduce at the moment.
u

钢铁侠

02/11/2019, 5:00 AM
This is a huge problem that db is too large,it has an impact on osqueries.
p

packetzero

02/11/2019, 5:15 AM
it certainly is
This is why you should use a vendor's osquery. Vendors are the only ones motivated to fix issues.
BTW, do you see logging lines like this: I0210 23:03:37.108081 69574656 rocksdb.cpp:68] RocksDB: [WARN] [db/column_family.cc:675] [queries] Stalling writes because we have 15 immutable memtables (waiting for flush), max_write_buffer_number is set to 16 rate 16777216
u

钢铁侠

02/11/2019, 5:22 AM
where do you see this logging lines?
p

packetzero

02/11/2019, 5:46 AM
It's an INFO log line, I am seeing it in the terminal
u

钢铁侠

02/11/2019, 5:56 AM
ok,let me check.I will tell you the result later
I just find another strange things that some of my machines have no process_events record but have only socket_event, when I run
curl
. it seems that the machine does not record the process_events. but the status of audit is normal.
I have no info like
RocksDB: [WARN] [db/column_family.cc:675] [queries] Stalling writes because we have 15 immutable memtables .....
s

shed7

02/11/2019, 2:36 PM
me neither
u

钢铁侠

02/12/2019, 5:08 AM
have you solved your problem?
@shed7
s

shed7

02/12/2019, 8:42 AM
Afraid not, I added the line packetzero gave above and recompiled, been using that version for about 12 hours, still the same problem exists but I don't see the DBG log line at all, and osqueryd has been restarted a few times
u

钢铁侠

02/13/2019, 8:13 AM
you can disable watchdog
s

shed7

02/13/2019, 8:32 AM
doesn't help
p

packetzero

02/13/2019, 2:37 PM
I tried in vain to reproduce this on Monday, with no luck.
s

shed7

02/13/2019, 3:24 PM
I'll start afresh on the boxes that are exhibiting this behaviour and keep and eye out. Thanks for your help