Still inverstigating changes in event framework fr...
# general
a
Still inverstigating changes in event framework from 4.7.0 and 4.8.0 Seems like
events_max
now closely tied with
events_expiry
. Previously
events_max
meant number of events to preserve in database at cleanup. Now it's a number of batches, and batches are formed by event time. Shouldn't this change be in
4.7.0
changelog?
s
All changes should be in the CHANGELOG, yes. But they may not be written to clearly convey complex effects. This is somewhat because it’s a complex system. And somewhat because I write the CHANGELOG, and don’t always get it right. And somewhat because the PRs themselves are not always very clear…
t
The two were always closely related. The intention of
events_max
was the maximum number of events per-subscriber to keep at any point in time. The
events_expiry
controlled how often old events would be removed. Each event has a time associated and times older than the expiry window are removed. If you do not expire events fast enough then you may hit the max and at that point the oldest events will be removed so as to not overflow the max.
4.8.0 fixed a regression introduced between 4.6.0 and 4.7.0, meaning the regression showed up as a bug in 4.7.0.
a
I agree with that, but this change have broken some assumptions that we used for out installation, so I wanted to discuss it and warn other osquery users. I wanted to warn, that there are backward incompatible change in osquery, that've been brought with eventing framework refactor and leads to some unexepected side effects if not handled properly. As an example I can describe a piece of out infrastructure: We use osquery to monitor linux hosts. To ensure that osquery won't pass allowed limits on memory and cpu usage we are using "softlimit" by osquery watchdog and "hardlimit" by cgroup. Rarely we have burst proccess_events activity on some of that hosts, that can make osquery scheduler to pass the cgroup limit on processing query and to be killed. Older
events_max
limits the number of unprocessed rows in table, and though osquery looses some events, it still works and doesn't trigger oom. But with new
events_max
the 50000 is too large for process events, and overflowing events eviction doesn't happen. Oom kills osquery thread, it restarts, reruns query and dies again several times untill the query gets denilisted (or continues oom-kill loop, if query is not allowed to be denilisted). For now I decreased schedule interval on this query and
events_max
to make osquery to fit into cgroup limit, but I am still searching for a better solution.
BTW, I think, default
event_max
should be also decreased. Here some calculations on my latest setup:
Copy code
[*]osquery> SELECT
[*]    ...>     count(*) AS event_cnt,
[*]    ...>     count(DISTINCT time) AS batches_cnt,
[*]    ...>     DATETIME(min(time), 'unixepoch') min_time,
[*]    ...>     DATETIME(max(time), 'unixepoch') AS max_time
[*]    ...> FROM process_events;

+-----------+-------------+---------------------+---------------------+
| event_cnt | batches_cnt | min_time            | max_time            |
+-----------+-------------+---------------------+---------------------+
| 75551     | 490         | 2021-05-31 21:48:22 | 2021-06-01 14:04:22 |
+-----------+-------------+---------------------+---------------------+
50000 batches is tooooo much for process_events.
@theopolis if haven't, please reads this message also. It looks like that
cleanup_events
becomes true on occasion, as it is the remaindor of sum of random values. I may be wrong, but this line is confusing me. But I am now sure if it should be triggered each 256 events or each 256 batches.
s/now/not/ 🙂 Sorry for some errors and misstypes in my messages
t
Ping @alessandrogario
(Sorry, meant to ping Alessandro) for clarification about the batching
a
I believe the old implementation was splitting batches in half, while the new implementation evicts entire batches
that was causing all the indexes to be split/rewritten in the database
a
Sorry, @alessandrogario, but I didn't got what you wanted to say. Old implementation (pre-4.7.0) limited the number of events in database (and limited the number of rows processed by query at one run). That could be used to do some assumptions and limit resource usage of osquery by dropping overflowing events. New
events_max
limits number of batches, but I dont get how to use it, as number of events per batch is random. I could have done some statistics measures and guess new
events_max
value, but here i have other problem, that gives me no guarantee that cleanup will ever happen.
t
I agree with @Alexander I put up the change here: https://github.com/osquery/osquery/pull/7143
🎉 1
❤️ 2