<@UAC1D100J> I think we have had some machines wit...
# general
p
@shed7 I think we have had some machines with this issue. Nice work on diagnosing the workaround. This reinforces my view that we need to have maximum sizes on the rocksdb indexes and data.
s
Cheers, I spoke too soon though. one of the machines is back up to 100% on 1 core but the worker is getting killed every 16 seconds or so
p
have you run sst_dump on the .sst files to see which tables / indexes are not being cleaned up?
s
have not heard of sst_dump before
p
it's not part of the osquery stuff, it's a tool in the rocksdb distro / build
s
what tables & indexes should be cleaned up when osqueryd stops ?
p
I'm thinking that you have an events table being populated, but maybe not queries in your schedule.
so it never gets removed
(the data never gets deleted)
which should happen every 5-minutes or so, and after restart
s
atm I have these flags:
Copy code
--audit_force_reconfigure=true
--audit_debug
--logger_min_status=1
--disable_events=false
--logger_plugin=syslog
--host_identifier=hostname
--schedule_splay_percent=10
--disable_audit=false
--audit_allow_config=true
--audit_persist=true
--audit_allow_process_events=true
--events_expiry=120
--events_max=50000
--audit_allow_sockets=true
--verbose
--watchdog_delay=120
--disable_extensions=true
and scheduled queries:
Copy code
"file_events": {
      "query": "SELECT * from file_events;",
      "removed": false,
      "interval": 30
"socket_events":{
      "query": "SELECT s.action, s.auid, s.family, s.local_address, s.local_port, s.path, s.pid, s.remote_address, s.remote_port, s.success, s.time, p.cmdline, p.cmdline_size, p.parent, p.uid, p.euid FROM socket_events s JOIN process_events p ON p.pid = s.pid WHERE s.action='bind';",
      "removed": false,
      "interval": 60
    },
    "process_events": {
      "query": "SELECT auid, cmdline, ctime, cwd, egid, euid, gid, parent, path, pid, time, uid FROM process_events WHERE path NOT IN ('/bin/date', '/bin/mktemp', '/usr/bin/dirname', '/usr/bin/head', '/bin/uname', '/bin/basename');",
      "removed": false,
      "interval": 30
not actual config just pasted here
the scheduled queries on the event tables are the same across the 50 or so boxes running osqueryd, but there are 6 or 7 which this problem. all same OS.
p
no query for process_socket_events ?
I mean socket_events
oh, I see it
IF you have only one query per event table, you can set events_expiry=1 , which means it clears after access
I recommend getting sst_dump and running it on the files to see what's in the DB
s
just compiling now, thanks for the tips
p
np
s
some old events in there, and 38k lines like this:
data.auditeventpublisher.process_events.0003350252' seq:6482196, type:0 =>
I've added events_expiry=1 and it lasted a few hours, but is now stuck in that same loop. this particular box generates about 450 process_events in a 30 second window
p
If you start with a new empty database, how long does it take to get into this high cpu loop?
It seems like the rocksdb indexes are never deleted (but data is), and it means osquery is wasting time processing all the unnecessary indexes.
@shed7 I added a comment in https://github.com/facebook/osquery/issues/4333 . If you can try adding that LOG line, and let me know if it's happening, that would help. It's a tough one to reproduce, as devs who have looked into this issue see that all works fine for them.
u
@packetzero This is me who called spoock1024 in this issue https://github.com/facebook/osquery/pull/5335
this scenario only happens in my specific machines
have you reproduce this scenario in your machines? @packetzero
even as you said
Indexes are only being deleted if all queries have been completed. But I don't think the query results are posted yet
how can we avoid this?
@packetzero
p
I am unable to reproduce at the moment.
u
This is a huge problem that db is too large,it has an impact on osqueries.
p
it certainly is
This is why you should use a vendor's osquery. Vendors are the only ones motivated to fix issues.
BTW, do you see logging lines like this: I0210 230337.108081 69574656 rocksdb.cpp:68] RocksDB: [WARN] [db/column_family.cc:675] [queries] Stalling writes because we have 15 immutable memtables (waiting for flush), max_write_buffer_number is set to 16 rate 16777216
u
where do you see this logging lines?
p
It's an INFO log line, I am seeing it in the terminal
u
ok,let me check.I will tell you the result later
I just find another strange things that some of my machines have no process_events record but have only socket_event, when I run
curl
. it seems that the machine does not record the process_events. but the status of audit is normal.
I have no info like
RocksDB: [WARN] [db/column_family.cc:675] [queries] Stalling writes because we have 15 immutable memtables .....
s
me neither
u
have you solved your problem?
@shed7
s
Afraid not, I added the line packetzero gave above and recompiled, been using that version for about 12 hours, still the same problem exists but I don't see the DBG log line at all, and osqueryd has been restarted a few times
u
you can disable watchdog
s
doesn't help
p
I tried in vain to reproduce this on Monday, with no luck.
s
I'll start afresh on the boxes that are exhibiting this behaviour and keep and eye out. Thanks for your help