Still inverstigating changes in event framework from 4 7 0 a osquery #general

Still inverstigating changes in event framework fr...

Alexander

05/31/2021, 6:14 PM

Still inverstigating changes in event framework from 4.7.0 and 4.8.0 Seems like

events_max

now closely tied with

events_expiry

. Previously

events_max

meant number of events to preserve in database at cleanup. Now it's a number of batches, and batches are formed by event time. Shouldn't this change be in

4.7.0

changelog?

seph

06/01/2021, 1:21 AM

All changes should be in the CHANGELOG, yes. But they may not be written to clearly convey complex effects. This is somewhat because it’s a complex system. And somewhat because I write the CHANGELOG, and don’t always get it right. And somewhat because the PRs themselves are not always very clear…

theopolis

06/01/2021, 4:02 AM

The two were always closely related. The intention of

events_max

was the maximum number of events per-subscriber to keep at any point in time. The

events_expiry

controlled how often old events would be removed. Each event has a time associated and times older than the expiry window are removed. If you do not expire events fast enough then you may hit the max and at that point the oldest events will be removed so as to not overflow the max.

theopolis

06/01/2021, 4:03 AM

4.8.0 fixed a regression introduced between 4.6.0 and 4.7.0, meaning the regression showed up as a bug in 4.7.0.

Alexander

06/01/2021, 2:17 PM

I agree with that, but this change have broken some assumptions that we used for out installation, so I wanted to discuss it and warn other osquery users. I wanted to warn, that there are backward incompatible change in osquery, that've been brought with eventing framework refactor and leads to some unexepected side effects if not handled properly. As an example I can describe a piece of out infrastructure: We use osquery to monitor linux hosts. To ensure that osquery won't pass allowed limits on memory and cpu usage we are using "softlimit" by osquery watchdog and "hardlimit" by cgroup. Rarely we have burst proccess_events activity on some of that hosts, that can make osquery scheduler to pass the cgroup limit on processing query and to be killed. Older

events_max

limits the number of unprocessed rows in table, and though osquery looses some events, it still works and doesn't trigger oom. But with new

events_max

the 50000 is too large for process events, and overflowing events eviction doesn't happen. Oom kills osquery thread, it restarts, reruns query and dies again several times untill the query gets denilisted (or continues oom-kill loop, if query is not allowed to be denilisted). For now I decreased schedule interval on this query and

events_max

to make osquery to fit into cgroup limit, but I am still searching for a better solution.

Alexander

06/01/2021, 2:18 PM

BTW, I think, default

event_max

should be also decreased. Here some calculations on my latest setup:

Copy code

[*]osquery> SELECT
[*]    ...>     count(*) AS event_cnt,
[*]    ...>     count(DISTINCT time) AS batches_cnt,
[*]    ...>     DATETIME(min(time), 'unixepoch') min_time,
[*]    ...>     DATETIME(max(time), 'unixepoch') AS max_time
[*]    ...> FROM process_events;

+-----------+-------------+---------------------+---------------------+
| event_cnt | batches_cnt | min_time            | max_time            |
+-----------+-------------+---------------------+---------------------+
| 75551     | 490         | 2021-05-31 21:48:22 | 2021-06-01 14:04:22 |
+-----------+-------------+---------------------+---------------------+

50000 batches is tooooo much for process_events.

Alexander

06/01/2021, 2:29 PM

@theopolis if haven't, please reads this message also. It looks like that

cleanup_events

becomes true on occasion, as it is the remaindor of sum of random values. I may be wrong, but this line is confusing me. But I am now sure if it should be triggered each 256 events or each 256 batches.

Alexander

06/01/2021, 3:25 PM

s/now/not/ 🙂 Sorry for some errors and misstypes in my messages

theopolis

06/01/2021, 4:19 PM

Ping @alessandrogario

theopolis

06/01/2021, 4:20 PM

(Sorry, meant to ping Alessandro) for clarification about the batching

alessandrogario

06/01/2021, 5:04 PM

I believe the old implementation was splitting batches in half, while the new implementation evicts entire batches

alessandrogario

06/01/2021, 5:05 PM

that was causing all the indexes to be split/rewritten in the database

alessandrogario

06/01/2021, 5:38 PM

https://github.com/osquery/osquery/blob/24bdf7842b548cd6d8ecfc4ea19e0e322af213ef/osquery/events/eventsubscriberplugin.cpp#L509

Alexander

06/02/2021, 2:25 PM

Sorry, @alessandrogario, but I didn't got what you wanted to say. Old implementation (pre-4.7.0) limited the number of events in database (and limited the number of rows processed by query at one run). That could be used to do some assumptions and limit resource usage of osquery by dropping overflowing events. New

events_max

limits number of batches, but I dont get how to use it, as number of events per batch is random. I could have done some statistics measures and guess new

events_max

value, but here i have other problem, that gives me no guarantee that cleanup will ever happen.

theopolis

06/07/2021, 2:39 AM

I agree with @Alexander I put up the change here: https://github.com/osquery/osquery/pull/7143

🎉 1

❤️ 2

7 Views

Open in Slack

Previous Next