Does anyone have a sense of where most of the resource utili osquery #core

Does anyone have a sense of where most of the reso...

zwass

05/11/2022, 11:09 PM

Does anyone have a sense of where most of the resource utilization in osquery comes from when using event-based tables with a high volume of events? Is it in the RocksDB read/write?

alessandrogario

05/11/2022, 11:38 PM

Is this about a specific platform? In Audit/BPF, there's a lot of stuff to parse, then there's a lot of state that need updating and finally some RocksDB too

alessandrogario

05/11/2022, 11:39 PM

Looking to solve this with the experimental BPF work

zwass

05/11/2022, 11:39 PM

I'm just trying to get an intuitive sense of what's going on when people talk about resource consumption issues with evented tables.

alessandrogario

05/11/2022, 11:40 PM

for macOS, I haven't tried but I'm expecting Endpoint Security to be working really well; events seems rather rich in metadata and didn't require too much parsing/state handling

alessandrogario

05/11/2022, 11:41 PM

On Linux we are at an additional disadvantage when the host is running many containers, essentially causing osquery to trace multiple machines worth of data at the same time

ty 1

defensivedepth

05/12/2022, 11:55 AM

When I talk about performance issues with evented tables, its usually two things: • osquery itself is struggling with the volume of events - REF: https://osquery.slack.com/archives/C08V7KTJB/p1647970028338159 • my backend system is struggling with the volume of events that is aggregated from a bunch of osquery endpoints sending evented data. Other agents (like Sysinternals Sysmon) allow you to have complex filters at the endpoint, before the events are shipped off the box Im actually presenting at bisdes ft. wayne next week on the 2nd issue - automatically generating osquery filters from sysmon configs. (Something @fritz worked with me on last year)

ty 1

Juan Alvarez

05/12/2022, 4:04 PM

We have many customers collecting

windows_events

to forward to our SIEM and we hit every time issues with the watchdog limits as well as the rocksdb limitation mentioned in the thread above. I think those are fairly common scenarios in bigger companies where EPS in the same box goes up to at least 500 EPS. I am not sure i can say where the memory consumption comes at a lower level, but we use to send the data using the

tls

logger and i think that the performance seems better when using

filesystem

(which we can combine with some other sw for remote send like fluentbit). Now i have been testing with a logger developed by us with ingestion levels near 1000 EPS aprox that seems stable but it just works in memory for now (no rocksdb). Just sharing this info in case it helps.

ty 1

seph

05/13/2022, 7:13 PM

In the distant past, Uptycs was talking about how performance in that kind of log shipping situation suffered because the data flows through the sql engine. Honestly, it always felt a bit weird? Like if you don’t want and of the SQL side, why use osquery as your high volume log shipper?

alessandrogario

05/13/2022, 7:13 PM

we had a cool concept from packetzero that traded data integrity for performance

alessandrogario

05/13/2022, 7:14 PM

I think we could have a different database plugin, a rocksdb alternative

zwass

05/13/2022, 7:14 PM

Ah yeah I remember both of those conversations. Would it be crazy to try to use SQLite?

alessandrogario

05/13/2022, 7:15 PM

I think there are different things in the discussion 1. Going through sqlite to compute the results 2. Where results are stored (currently rocksdb) 3. Where buffered log lines are stored (also rocksdb, from the buffered log forwarder)

alessandrogario

05/13/2022, 7:16 PM

packetzero's PoC replaced 2 and 3, but still went through 1

alessandrogario

05/13/2022, 7:16 PM

uptycs wanted to bypass 1/2/3 entirely

seph

05/13/2022, 7:18 PM

I’m generally against bypassing (1). Mostly coming from a “what else is osquery” stance. I’m willing to be convinced. And I have no deep feelings about (2) and (3). To me, those are implementation details, and I’ll roll with whatever y’all tell me makes sense

💯 1

alessandrogario

05/13/2022, 7:18 PM

Totally agree with the above sentence

alessandrogario

05/13/2022, 7:18 PM

We also have to talk about integrity again, because it's not as guaranteed as it could in RocksDB

alessandrogario

05/13/2022, 7:19 PM

and one might argue that it should either be all off and fast, or all on and slow (i.e. not the current situation where it's mixed)

zwass

05/13/2022, 7:50 PM

I'm most curious about whether it could make sense to use sqlite to do 2 and 3. But it doesn't matter if RocksDB is only a small portion of the resource utilization for the event based tables.

alessandrogario

05/13/2022, 7:55 PM

before the integrity was relaxed again, it was taking a significant toll on cpu/memory

alessandrogario

05/13/2022, 7:55 PM

I seem to recall that packetzero's PoC had clear advantages on that front

alessandrogario

05/13/2022, 7:57 PM

I think we had an sqlite database plugin, but was deprecated (during experimetal). It was frowned upon, but I never used it so I don't know how it performed or why it was removed

seph

05/13/2022, 7:59 PM

Does our SQLite have any kind of disk persistence mechanism?

zwass

05/13/2022, 8:35 PM

IIRC we open the DB in memory, so even when you do things like

create table

they go away on restart.

seph

05/13/2022, 8:36 PM

Kinda what I mean… So using sqlite for (2) or (3) has data loss questions

zwass

05/13/2022, 8:39 PM

Ah yeah but I would think if we did it we would use disk-backed sqlite.

seph

05/13/2022, 8:39 PM

Could be an interesting experiment. Though I wonder why. It feels like a deepish change, and hitting disk is hitting disk.

zwass

05/13/2022, 8:41 PM

Partly I'm interested in the idea of "sqlite all the way down". Getting some of the data living in an actual sqlite table seems like it could bring osquery's behavior closer to what many folks expect.

seph

05/13/2022, 8:41 PM

Maaaybe. I am somewhat sketpcial.

zwass

05/13/2022, 8:41 PM

Also I'm still scarred from seeing so many DB corruption issues in the past with RocksDB. I'm not sure that's been an issue as much lately though?

seph

05/13/2022, 8:42 PM

For (2) and (3) I think we’d end up needing to do a lot that subverts sqlite expectations.

seph

05/13/2022, 8:44 PM

My gut sense, is that hinting around “what a db should be” what they really mean is something about performance on tables. Eg: tables should be data on disk, so that join performance is as expected. I’m sympathetic, but I’m not always sure I agree. Mostly I think of osquery as an api translation layer. With that caveat, I think exploring osquery as more of a db would be interesting. But I don’t think I’d come at it from (2) and (3) above. Those feel weirdly tied into events. I’d probably start with: • The existing queuery cache stuff that never seems to work? • stefano’s caching code • Moving away from eponymous tables to ??? But I think there are a lot of hard-to-answer questions.

seph

05/13/2022, 8:45 PM

Something like the

file

table, or the

plist

one cannot be real tables. Those are close to functions masquerading as tables.

zwass

05/13/2022, 10:07 PM

Many (most?) tables aren't conducive to that pattern because there's no API for events. Osquery as an api translation layer works really well for a lot of apis, but less so for events IMO.

seph

05/13/2022, 10:14 PM

I agree! I don't have a simple model for events.

seph

05/13/2022, 10:14 PM

It's like the api translation bolted onto a table store, with a magic cleanup routine. Which is a mouthful.

💯 1

alessandrogario

05/13/2022, 11:49 PM

joins are also really weird when using evented tables

alessandrogario

05/13/2022, 11:51 PM

i wouldn't mind having an alternative interface to event data, the problem is identifying how it should look like

alessandrogario

05/13/2022, 11:52 PM

given how JOINs essentially add race conditions to evented tables, one way of thinking would be to just never do it and make sure evented rows have everything you could possibly need

alessandrogario

05/13/2022, 11:53 PM

at that point though it doesn't make much sense to use sqlite to access it though

alessandrogario

05/13/2022, 11:53 PM

(race conditions as in: joining the user id of an audit event with the users table)

seph

05/14/2022, 12:02 AM

Thinking aloud…. 1. Events feed a table 2. That table has some max events cleanup 3. Events have a unique Id 4. join select, whatever. 5. The magic thing would track last id, and add an implicit “where id > last” But that's a lot of overhead, and I'm not sure it would be performant.

alessandrogario

05/14/2022, 12:05 AM

I think the problem is that you can for example add/remove users while events are generated

alessandrogario

05/14/2022, 12:05 AM

if the event itself is not capturing the username then it's only a guess what is going to happen when joining against the users table

alessandrogario

05/14/2022, 12:06 AM

worst case scenario user IDs are being reused and we would end up with the same ID mapping to different names

alessandrogario

05/14/2022, 12:08 AM

if we talk about data "quality", events are high because they are coming from the source and hopefully things have been acquired atomically (like endpoint sec, or the metadata we get in bpf/audit)

alessandrogario

05/14/2022, 12:08 AM

joining against anything will lower the quality significantly, and I wondered many times if it actually makes sense to do it

alessandrogario

05/14/2022, 12:09 AM

i can see why uptycs would like to acquire data as is, sending it directly to the logger

19 Views

Open in Slack

Previous Next