Title
#ebpf
CptOfEvilMinions

CptOfEvilMinions

03/24/2022, 8:40 PM
I have an Osquery agent producing the following errors with eBPF enabled:
2022-03-23 07:53:27	
Worker returned exit status
2022-03-23 07:53:08	
Error logging the results of query: pack/test-pack/BPF_PROC_EVENTS: IOError: Bad file descriptor
2022-03-23 07:53:08	
Error adding new results to database for query pack/test-pack/BPF_PROC_EVENTS: IOError: Bad file descriptor
2022-03-23 07:52:53	
RocksDB: [ERROR] [db/db_impl/db_impl_compaction_flush.cc:2541] Waiting after background compaction error: IO error: While appending to file: /var/osquery/osquery.db/052008.sst: Bad file descriptor, Accumulated background error counts: 1
2022-03-23 07:52:53	
RocksDB: [WARN] [db/error_handler.cc:334] Background IO error IO error: While appending to file: /var/osquery/osquery.db/052008.sst: Bad file descriptor
2022-03-23 07:52:53	
RocksDB: [WARN] [db/db_impl/db_impl_compaction_flush.cc:3019] Compaction error: IO error: While appending to file: /var/osquery/osquery.db/052008.sst: Bad file descriptor
eBPF flags
#### Process Auditing ####
--disable_events=false
--enable_bpf_events=true
--events_optimize=true
--events_expiry=3600
--events_max=200000
Pack queries:
SELECT * FROM bpf_socket_events WHERE local_port != 0;

SELECT * FROM bpf_process_events;
8:40 PM
@sharvil
a

alessandrogario

03/24/2022, 9:10 PM
This is bad, is this the latest version of osquery?
CptOfEvilMinions

CptOfEvilMinions

03/24/2022, 9:19 PM
osqueryd --version
osqueryd version 5.1.0
5:17 PM
@alessandrogario any next steps you want me to take? • Config dumps? • update to osquery v5.2.2 • other?
a

alessandrogario

03/28/2022, 5:29 PM
I am not really sure how to proceed, I don't think that bpf itself can break the database like that
5:29 PM
my guess is that BPF is tracing a lot of process activity and the watchdog is causing the worker to get killed
5:29 PM
and after a repeated amount of kills, the database just broke and stopped working
5:40 PM
If you are able to reproduce this with a new database, you could try to disable the watchdog
5:40 PM
if it doesn't happen again, then the worker being killed was the problem
5:41 PM
as an additional note, the BPF code does not access the database directly or use rocksdb in any way
CptOfEvilMinions

CptOfEvilMinions

03/28/2022, 6:44 PM
So the watchdog mem limit is set to 8GBs and the cpu limit is the default
6:45 PM
Could the watchdog cpu limit be causing this?
a

alessandrogario

03/28/2022, 7:04 PM
That is my guess, yes
7:04 PM
How many processes are running on that machine?
CptOfEvilMinions

CptOfEvilMinions

03/29/2022, 5:22 PM
How many processes on the overall system or osquery proc count?
5:23 PM
It’s a test machine with alot of hardware resources so overall system resources should not be the issue
5:24 PM
I will try disabling the watchdog and see if I can reproduce the error above.
a

alessandrogario

03/29/2022, 5:25 PM
How many processes are being created/closed and how many files are being opened/closed
5:25 PM
and/or, if there are containers running on the host and how many
5:26 PM
mostly to estimate how many events are hitting osquery
5:26 PM
also, as powerful as that machine can be, the event stream is processed in order and it's single threaded
n

npamnani

03/30/2022, 7:28 AM
From the logs it looks like, problem is not confined to rocksdb, some code is randomly closing the fd or there is a double close on an fd.
CptOfEvilMinions

CptOfEvilMinions

04/06/2022, 5:09 PM
Sorry, for the delayed response, it was a busy week. I currently disabled the watchdog limits with
-1
.
5:11 PM
@alessandrogario as for your question about open files and procs, I am not sure. I can try and get some numbers. That being said, these are production systems so it’s expected to be high volume. Has Osquery + eBPF been tested on production systems?
a

alessandrogario

04/06/2022, 5:14 PM
I think bpf in its current form is hard to use on machines with high load
5:15 PM
in order to avoid dependencies, as per the osquery guidelines, bpf has to trace all the open/fork/exec/close/chdir/etc in order to work
5:16 PM
on those systems maybe audit could be a better idea for now, since the kernel provides a lot more metadata compared to what can be easily gathered with bpf