< shed7> I think we have had some machines with this issue N osquery #general

<@UAC1D100J> I think we have had some machines wit...

packetzero

02/08/2019, 2:50 PM

@shed7 I think we have had some machines with this issue. Nice work on diagnosing the workaround. This reinforces my view that we need to have maximum sizes on the rocksdb indexes and data.

shed7

02/08/2019, 2:56 PM

Cheers, I spoke too soon though. one of the machines is back up to 100% on 1 core but the worker is getting killed every 16 seconds or so

packetzero

02/08/2019, 2:57 PM

have you run sst_dump on the .sst files to see which tables / indexes are not being cleaned up?

shed7

02/08/2019, 2:58 PM

have not heard of sst_dump before

packetzero

02/08/2019, 2:58 PM

it's not part of the osquery stuff, it's a tool in the rocksdb distro / build

shed7

02/08/2019, 2:59 PM

what tables & indexes should be cleaned up when osqueryd stops ?

packetzero

02/08/2019, 2:59 PM

I'm thinking that you have an events table being populated, but maybe not queries in your schedule.

packetzero

02/08/2019, 2:59 PM

so it never gets removed

packetzero

02/08/2019, 3:00 PM

(the data never gets deleted)

packetzero

02/08/2019, 3:00 PM

which should happen every 5-minutes or so, and after restart

shed7

02/08/2019, 3:01 PM

atm I have these flags:

Copy code

--audit_force_reconfigure=true
--audit_debug
--logger_min_status=1
--disable_events=false
--logger_plugin=syslog
--host_identifier=hostname
--schedule_splay_percent=10
--disable_audit=false
--audit_allow_config=true
--audit_persist=true
--audit_allow_process_events=true
--events_expiry=120
--events_max=50000
--audit_allow_sockets=true
--verbose
--watchdog_delay=120
--disable_extensions=true

shed7

02/08/2019, 3:02 PM

and scheduled queries:

Copy code

"file_events": {
      "query": "SELECT * from file_events;",
      "removed": false,
      "interval": 30
"socket_events":{
      "query": "SELECT s.action, s.auid, s.family, s.local_address, s.local_port, s.path, s.pid, s.remote_address, s.remote_port, s.success, s.time, p.cmdline, p.cmdline_size, p.parent, p.uid, p.euid FROM socket_events s JOIN process_events p ON p.pid = s.pid WHERE s.action='bind';",
      "removed": false,
      "interval": 60
    },
    "process_events": {
      "query": "SELECT auid, cmdline, ctime, cwd, egid, euid, gid, parent, path, pid, time, uid FROM process_events WHERE path NOT IN ('/bin/date', '/bin/mktemp', '/usr/bin/dirname', '/usr/bin/head', '/bin/uname', '/bin/basename');",
      "removed": false,
      "interval": 30

shed7

02/08/2019, 3:03 PM

not actual config just pasted here

shed7

02/08/2019, 3:04 PM

the scheduled queries on the event tables are the same across the 50 or so boxes running osqueryd, but there are 6 or 7 which this problem. all same OS.

packetzero

02/08/2019, 3:09 PM

no query for process_socket_events ?

packetzero

02/08/2019, 3:09 PM

I mean socket_events

packetzero

02/08/2019, 3:09 PM

oh, I see it

packetzero

02/08/2019, 3:12 PM

IF you have only one query per event table, you can set events_expiry=1 , which means it clears after access

packetzero

02/08/2019, 3:13 PM

I recommend getting sst_dump and running it on the files to see what's in the DB

packetzero

02/08/2019, 3:14 PM

see this one: https://github.com/facebook/osquery/issues/4333

shed7

02/08/2019, 3:14 PM

just compiling now, thanks for the tips

packetzero

02/08/2019, 3:15 PM

shed7

02/08/2019, 4:43 PM

some old events in there, and 38k lines like this:

data.auditeventpublisher.process_events.0003350252' seq:6482196, type:0 =>

shed7

02/09/2019, 10:12 AM

I've added events_expiry=1 and it lasted a few hours, but is now stuck in that same loop. this particular box generates about 450 process_events in a 30 second window

packetzero

02/09/2019, 7:43 PM

If you start with a new empty database, how long does it take to get into this high cpu loop?

packetzero

02/09/2019, 7:44 PM

It seems like the rocksdb indexes are never deleted (but data is), and it means osquery is wasting time processing all the unnecessary indexes.

packetzero

02/10/2019, 6:14 PM

@shed7 I added a comment in https://github.com/facebook/osquery/issues/4333 . If you can try adding that LOG line, and let me know if it's happening, that would help. It's a tough one to reproduce, as devs who have looked into this issue see that all works fine for them.

钢铁侠

02/11/2019, 3:42 AM

@packetzero This is me who called spoock1024 in this issue https://github.com/facebook/osquery/pull/5335

钢铁侠

02/11/2019, 3:42 AM

this scenario only happens in my specific machines

钢铁侠

02/11/2019, 4:02 AM

have you reproduce this scenario in your machines? @packetzero

钢铁侠

02/11/2019, 4:24 AM

even as you said

Indexes are only being deleted if all queries have been completed. But I don't think the query results are posted yet

how can we avoid this?

钢铁侠

02/11/2019, 4:24 AM

@packetzero

packetzero

02/11/2019, 4:54 AM

I am unable to reproduce at the moment.

钢铁侠

02/11/2019, 5:00 AM

This is a huge problem that db is too large,it has an impact on osqueries.

packetzero

02/11/2019, 5:15 AM

it certainly is

packetzero

02/11/2019, 5:16 AM

This is why you should use a vendor's osquery. Vendors are the only ones motivated to fix issues.

packetzero

02/11/2019, 5:17 AM

BTW, do you see logging lines like this: I0210 230337.108081 69574656 rocksdb.cpp:68] RocksDB: [WARN] [db/column_family.cc:675] [queries] Stalling writes because we have 15 immutable memtables (waiting for flush), max_write_buffer_number is set to 16 rate 16777216

钢铁侠

02/11/2019, 5:22 AM

where do you see this logging lines?

packetzero

02/11/2019, 5:46 AM

It's an INFO log line, I am seeing it in the terminal

钢铁侠

02/11/2019, 5:56 AM

ok,let me check.I will tell you the result later

钢铁侠

02/11/2019, 6:56 AM

I just find another strange things that some of my machines have no process_events record but have only socket_event, when I run

curl

. it seems that the machine does not record the process_events. but the status of audit is normal.

钢铁侠

02/11/2019, 8:00 AM

I have no info like

RocksDB: [WARN] [db/column_family.cc:675] [queries] Stalling writes because we have 15 immutable memtables .....

shed7

02/11/2019, 2:36 PM

me neither

钢铁侠

02/12/2019, 5:08 AM

have you solved your problem?

钢铁侠

02/12/2019, 5:08 AM

@shed7

shed7

02/12/2019, 8:42 AM

Afraid not, I added the line packetzero gave above and recompiled, been using that version for about 12 hours, still the same problem exists but I don't see the DBG log line at all, and osqueryd has been restarted a few times

钢铁侠

02/13/2019, 8:13 AM

you can disable watchdog

shed7

02/13/2019, 8:32 AM

doesn't help

packetzero

02/13/2019, 2:37 PM

I tried in vain to reproduce this on Monday, with no luck.

shed7

02/13/2019, 3:24 PM

I'll start afresh on the boxes that are exhibiting this behaviour and keep and eye out. Thanks for your help

5 Views

Open in Slack

Previous Next