Title
#core
zwass

zwass

05/11/2022, 11:09 PM
Does anyone have a sense of where most of the resource utilization in osquery comes from when using event-based tables with a high volume of events? Is it in the RocksDB read/write?
a

alessandrogario

05/11/2022, 11:38 PM
Is this about a specific platform? In Audit/BPF, there's a lot of stuff to parse, then there's a lot of state that need updating and finally some RocksDB too
11:39 PM
Looking to solve this with the experimental BPF work
zwass

zwass

05/11/2022, 11:39 PM
I'm just trying to get an intuitive sense of what's going on when people talk about resource consumption issues with evented tables.
a

alessandrogario

05/11/2022, 11:40 PM
for macOS, I haven't tried but I'm expecting Endpoint Security to be working really well; events seems rather rich in metadata and didn't require too much parsing/state handling
11:41 PM
On Linux we are at an additional disadvantage when the host is running many containers, essentially causing osquery to trace multiple machines worth of data at the same time
defensivedepth

defensivedepth

05/12/2022, 11:55 AM
When I talk about performance issues with evented tables, its usually two things: • osquery itself is struggling with the volume of events - REF: https://osquery.slack.com/archives/C08V7KTJB/p1647970028338159 • my backend system is struggling with the volume of events that is aggregated from a bunch of osquery endpoints sending evented data. Other agents (like Sysinternals Sysmon) allow you to have complex filters at the endpoint, before the events are shipped off the box Im actually presenting at bisdes ft. wayne next week on the 2nd issue - automatically generating osquery filters from sysmon configs. (Something @fritz worked with me on last year)
j

Juan Alvarez

05/12/2022, 4:04 PM
We have many customers collecting
windows_events
to forward to our SIEM and we hit every time issues with the watchdog limits as well as the rocksdb limitation mentioned in the thread above. I think those are fairly common scenarios in bigger companies where EPS in the same box goes up to at least 500 EPS. I am not sure i can say where the memory consumption comes at a lower level, but we use to send the data using the
tls
logger and i think that the performance seems better when using
filesystem
(which we can combine with some other sw for remote send like fluentbit). Now i have been testing with a logger developed by us with ingestion levels near 1000 EPS aprox that seems stable but it just works in memory for now (no rocksdb). Just sharing this info in case it helps.
s

seph

05/13/2022, 7:13 PM
In the distant past, Uptycs was talking about how performance in that kind of log shipping situation suffered because the data flows through the sql engine. Honestly, it always felt a bit weird? Like if you don’t want and of the SQL side, why use osquery as your high volume log shipper?
a

alessandrogario

05/13/2022, 7:13 PM
we had a cool concept from packetzero that traded data integrity for performance
7:14 PM
I think we could have a different database plugin, a rocksdb alternative
zwass

zwass

05/13/2022, 7:14 PM
Ah yeah I remember both of those conversations. Would it be crazy to try to use SQLite?
a

alessandrogario

05/13/2022, 7:15 PM
I think there are different things in the discussion1. Going through sqlite to compute the results 2. Where results are stored (currently rocksdb) 3. Where buffered log lines are stored (also rocksdb, from the buffered log forwarder)
7:16 PM
packetzero's PoC replaced 2 and 3, but still went through 1
7:16 PM
uptycs wanted to bypass 1/2/3 entirely
s

seph

05/13/2022, 7:18 PM
I’m generally against bypassing (1). Mostly coming from a “what else is osquery” stance. I’m willing to be convinced. And I have no deep feelings about (2) and (3). To me, those are implementation details, and I’ll roll with whatever y’all tell me makes sense
a

alessandrogario

05/13/2022, 7:18 PM
Totally agree with the above sentence
7:18 PM
We also have to talk about integrity again, because it's not as guaranteed as it could in RocksDB
7:19 PM
and one might argue that it should either be all off and fast, or all on and slow (i.e. not the current situation where it's mixed)
zwass

zwass

05/13/2022, 7:50 PM
I'm most curious about whether it could make sense to use sqlite to do 2 and 3. But it doesn't matter if RocksDB is only a small portion of the resource utilization for the event based tables.
a

alessandrogario

05/13/2022, 7:55 PM
before the integrity was relaxed again, it was taking a significant toll on cpu/memory
7:55 PM
I seem to recall that packetzero's PoC had clear advantages on that front
7:57 PM
I think we had an sqlite database plugin, but was deprecated (during experimetal). It was frowned upon, but I never used it so I don't know how it performed or why it was removed
s

seph

05/13/2022, 7:59 PM
Does our SQLite have any kind of disk persistence mechanism?
zwass

zwass

05/13/2022, 8:35 PM
IIRC we open the DB in memory, so even when you do things like
create table
they go away on restart.
s

seph

05/13/2022, 8:36 PM
Kinda what I mean… So using sqlite for (2) or (3) has data loss questions
zwass

zwass

05/13/2022, 8:39 PM
Ah yeah but I would think if we did it we would use disk-backed sqlite.
s

seph

05/13/2022, 8:39 PM
Could be an interesting experiment. Though I wonder why. It feels like a deepish change, and hitting disk is hitting disk.
zwass

zwass

05/13/2022, 8:41 PM
Partly I'm interested in the idea of "sqlite all the way down". Getting some of the data living in an actual sqlite table seems like it could bring osquery's behavior closer to what many folks expect.
s

seph

05/13/2022, 8:41 PM
Maaaybe. I am somewhat sketpcial.
zwass

zwass

05/13/2022, 8:41 PM
Also I'm still scarred from seeing so many DB corruption issues in the past with RocksDB. I'm not sure that's been an issue as much lately though?
s

seph

05/13/2022, 8:42 PM
For (2) and (3) I think we’d end up needing to do a lot that subverts sqlite expectations.
8:44 PM
My gut sense, is that hinting around “what a db should be” what they really mean is something about performance on tables. Eg: tables should be data on disk, so that join performance is as expected. I’m sympathetic, but I’m not always sure I agree. Mostly I think of osquery as an api translation layer. With that caveat, I think exploring osquery as more of a db would be interesting. But I don’t think I’d come at it from (2) and (3) above. Those feel weirdly tied into events. I’d probably start with: • The existing queuery cache stuff that never seems to work? • stefano’s caching code • Moving away from eponymous tables to ??? But I think there are a lot of hard-to-answer questions.
8:45 PM
Something like the
file
table, or the
plist
one cannot be real tables. Those are close to functions masquerading as tables.
zwass

zwass

05/13/2022, 10:07 PM
Many (most?) tables aren't conducive to that pattern because there's no API for events. Osquery as an api translation layer works really well for a lot of apis, but less so for events IMO.
s

seph

05/13/2022, 10:14 PM
I agree! I don't have a simple model for events.
10:14 PM
It's like the api translation bolted onto a table store, with a magic cleanup routine. Which is a mouthful.
a

alessandrogario

05/13/2022, 11:49 PM
joins are also really weird when using evented tables
11:51 PM
i wouldn't mind having an alternative interface to event data, the problem is identifying how it should look like
11:52 PM
given how JOINs essentially add race conditions to evented tables, one way of thinking would be to just never do it and make sure evented rows have everything you could possibly need
11:53 PM
at that point though it doesn't make much sense to use sqlite to access it though
11:53 PM
(race conditions as in: joining the user id of an audit event with the users table)
s

seph

05/14/2022, 12:02 AM
Thinking aloud…. 1. Events feed a table 2. That table has some max events cleanup 3. Events have a unique Id 4. join select, whatever. 5. The magic thing would track last id, and add an implicit “where id > last” But that's a lot of overhead, and I'm not sure it would be performant.
a

alessandrogario

05/14/2022, 12:05 AM
I think the problem is that you can for example add/remove users while events are generated
12:05 AM
if the event itself is not capturing the username then it's only a guess what is going to happen when joining against the users table
12:06 AM
worst case scenario user IDs are being reused and we would end up with the same ID mapping to different names
12:08 AM
if we talk about data "quality", events are high because they are coming from the source and hopefully things have been acquired atomically (like endpoint sec, or the metadata we get in bpf/audit)
12:08 AM
joining against anything will lower the quality significantly, and I wondered many times if it actually makes sense to do it
12:09 AM
i can see why uptycs would like to acquire data as is, sending it directly to the logger