Does anyone have a sense of where most of the reso...
# core
z
Does anyone have a sense of where most of the resource utilization in osquery comes from when using event-based tables with a high volume of events? Is it in the RocksDB read/write?
a
Is this about a specific platform? In Audit/BPF, there's a lot of stuff to parse, then there's a lot of state that need updating and finally some RocksDB too
Looking to solve this with the experimental BPF work
z
I'm just trying to get an intuitive sense of what's going on when people talk about resource consumption issues with evented tables.
a
for macOS, I haven't tried but I'm expecting Endpoint Security to be working really well; events seems rather rich in metadata and didn't require too much parsing/state handling
On Linux we are at an additional disadvantage when the host is running many containers, essentially causing osquery to trace multiple machines worth of data at the same time
ty 1
d
When I talk about performance issues with evented tables, its usually two things: • osquery itself is struggling with the volume of events - REF: https://osquery.slack.com/archives/C08V7KTJB/p1647970028338159 • my backend system is struggling with the volume of events that is aggregated from a bunch of osquery endpoints sending evented data. Other agents (like Sysinternals Sysmon) allow you to have complex filters at the endpoint, before the events are shipped off the box Im actually presenting at bisdes ft. wayne next week on the 2nd issue - automatically generating osquery filters from sysmon configs. (Something @fritz worked with me on last year)
ty 1
j
We have many customers collecting
windows_events
to forward to our SIEM and we hit every time issues with the watchdog limits as well as the rocksdb limitation mentioned in the thread above. I think those are fairly common scenarios in bigger companies where EPS in the same box goes up to at least 500 EPS. I am not sure i can say where the memory consumption comes at a lower level, but we use to send the data using the
tls
logger and i think that the performance seems better when using
filesystem
(which we can combine with some other sw for remote send like fluentbit). Now i have been testing with a logger developed by us with ingestion levels near 1000 EPS aprox that seems stable but it just works in memory for now (no rocksdb). Just sharing this info in case it helps.
ty 1
s
In the distant past, Uptycs was talking about how performance in that kind of log shipping situation suffered because the data flows through the sql engine. Honestly, it always felt a bit weird? Like if you don’t want and of the SQL side, why use osquery as your high volume log shipper?
a
we had a cool concept from packetzero that traded data integrity for performance
I think we could have a different database plugin, a rocksdb alternative
z
Ah yeah I remember both of those conversations. Would it be crazy to try to use SQLite?
a
I think there are different things in the discussion 1. Going through sqlite to compute the results 2. Where results are stored (currently rocksdb) 3. Where buffered log lines are stored (also rocksdb, from the buffered log forwarder)
packetzero's PoC replaced 2 and 3, but still went through 1
uptycs wanted to bypass 1/2/3 entirely
s
I’m generally against bypassing (1). Mostly coming from a “what else is osquery” stance. I’m willing to be convinced. And I have no deep feelings about (2) and (3). To me, those are implementation details, and I’ll roll with whatever y’all tell me makes sense
💯 1
a
Totally agree with the above sentence
We also have to talk about integrity again, because it's not as guaranteed as it could in RocksDB
and one might argue that it should either be all off and fast, or all on and slow (i.e. not the current situation where it's mixed)
z
I'm most curious about whether it could make sense to use sqlite to do 2 and 3. But it doesn't matter if RocksDB is only a small portion of the resource utilization for the event based tables.
a
before the integrity was relaxed again, it was taking a significant toll on cpu/memory
I seem to recall that packetzero's PoC had clear advantages on that front
I think we had an sqlite database plugin, but was deprecated (during experimetal). It was frowned upon, but I never used it so I don't know how it performed or why it was removed
s
Does our SQLite have any kind of disk persistence mechanism?
z
IIRC we open the DB in memory, so even when you do things like
create table
they go away on restart.
s
Kinda what I mean… So using sqlite for (2) or (3) has data loss questions
z
Ah yeah but I would think if we did it we would use disk-backed sqlite.
s
Could be an interesting experiment. Though I wonder why. It feels like a deepish change, and hitting disk is hitting disk.
z
Partly I'm interested in the idea of "sqlite all the way down". Getting some of the data living in an actual sqlite table seems like it could bring osquery's behavior closer to what many folks expect.
s
Maaaybe. I am somewhat sketpcial.
z
Also I'm still scarred from seeing so many DB corruption issues in the past with RocksDB. I'm not sure that's been an issue as much lately though?
s
For (2) and (3) I think we’d end up needing to do a lot that subverts sqlite expectations.
My gut sense, is that hinting around “what a db should be” what they really mean is something about performance on tables. Eg: tables should be data on disk, so that join performance is as expected. I’m sympathetic, but I’m not always sure I agree. Mostly I think of osquery as an api translation layer. With that caveat, I think exploring osquery as more of a db would be interesting. But I don’t think I’d come at it from (2) and (3) above. Those feel weirdly tied into events. I’d probably start with: • The existing queuery cache stuff that never seems to work? • stefano’s caching code • Moving away from eponymous tables to ??? But I think there are a lot of hard-to-answer questions.
Something like the
file
table, or the
plist
one cannot be real tables. Those are close to functions masquerading as tables.
z
Many (most?) tables aren't conducive to that pattern because there's no API for events. Osquery as an api translation layer works really well for a lot of apis, but less so for events IMO.
s
I agree! I don't have a simple model for events.
It's like the api translation bolted onto a table store, with a magic cleanup routine. Which is a mouthful.
💯 1
a
joins are also really weird when using evented tables
i wouldn't mind having an alternative interface to event data, the problem is identifying how it should look like
given how JOINs essentially add race conditions to evented tables, one way of thinking would be to just never do it and make sure evented rows have everything you could possibly need
at that point though it doesn't make much sense to use sqlite to access it though
(race conditions as in: joining the user id of an audit event with the users table)
s
Thinking aloud…. 1. Events feed a table 2. That table has some max events cleanup 3. Events have a unique Id 4. join select, whatever. 5. The magic thing would track last id, and add an implicit “where id > last” But that's a lot of overhead, and I'm not sure it would be performant.
a
I think the problem is that you can for example add/remove users while events are generated
if the event itself is not capturing the username then it's only a guess what is going to happen when joining against the users table
worst case scenario user IDs are being reused and we would end up with the same ID mapping to different names
if we talk about data "quality", events are high because they are coming from the source and hopefully things have been acquired atomically (like endpoint sec, or the metadata we get in bpf/audit)
joining against anything will lower the quality significantly, and I wondered many times if it actually makes sense to do it
i can see why uptycs would like to acquire data as is, sending it directly to the logger