Hi, I wonder if I could get some ideas on further ...
# general
a
Hi, I wonder if I could get some ideas on further troubleshooting a recent EC2 performance alert that seems was generated by osquery (version 4.5.1). The EC2 runs a high number of containers and the alert was a decreased on the EBS Burst Balance. I have followed some of the recommended troubleshooting so I could determine: • None of the scheduled queries got denylisted • There is an important number of process events in the machine • osquery_events table showed at one point there were up to 1 million process events • osquery_info db size got up to 300Mb Is there any way we could reduce IOPs operations by any of the settings for highly loaded machines that generate a lot of process events? Relevant osquery config is attached. Also is there any way we can get disk statistics on Linux so we can further determined that osquery was not the cause (I think it relates to this open PR https://github.com/osquery/osquery/issues/6288 but wanted to double check). Thanks
And some of the relevant config:
Copy code
--audit_allow_config=true
--audit_persist=true
--disable_audit=false
--disable_distributed=false
--distributed_interval=10
--distributed_plugin=tls
--events_expiry=1
--events_max=500000
--logger_min_status=1
--logger_plugin=tls
--schedule_splay_percent=10
--disable_watchdog=false
--hash_delay=20
--pack_refresh_interval=3600
--table_delay=20
--watchdog_delay=120
--watchdog_level=1
--watchdog_memory_limit=200
t
Interesting, are there details in the alert about the length of the burst? Was there high IOPs for 60s, or for an hour, etc? Has the burst happened more than once? Is there something osquery-related you can correlate with?
(To answer your question, I cannot think of a way to get the relevant iostats right now within osquery, we'd need what that feature request asks for.) If the osquery daemon process is still running you can look at its current numbers and see if they are unreasonable.
/proc/$PID/io
should have the read/write bytes. You wont know what the rate was at the time of the alert. I also do not know how accurate these will be. For example, do they include stats about the audit netlink pipe?
a
Thanks @theopolis for your answers appreciate your time in here. Regarding the alert, it was due to lower BurstBalance which shows the % of the burst allowance and was an inverted mountain shape that lasted for around 12 hours to get back to top 100%. Other AWS cloudwatch graphs for volume read / write OPS look normal and do not correlate with the burst one. We still trying to correlate if the problem is osquery or just a side effect. We deploy also osquery daemon with cgroups to also limit IOPs but logs do not indicate there was an issue there. Sadly osquery has got into a weird status where we get two errors continuously:
Copy code
watcher.cpp:360] osqueryd worker (194852) stopping: Memory limits exceeded: 213448000
udev.cpp:90] udev monitor returned invalid device: No buffer space available
and now we are not getting any logs out to the tls logger (events column on the
osquery_events
table has zero on all rows), and
config_valid
is 0.
Also latest
osquery_schedule
logs we got into the SIEM did not indicate any of the schedule queries were added into a denylist.
t
That error loop you describe is occurring with 4.5.1?
a
that is correct, we recently deployed the latest version of osquery
t
Would you mind creating a GitHub issue for this and try to include as much debugging info and if possible steps to reproduce? If the config is not valid then no queries would exist in the schedule and thus no logs. But why is the config invalid? <— this is the bug worth debugging IMO
a
thanks @theopolis so I just noticed that the problem with the
config_valid
as 0, only comes when running it from
osqueryi
and is not the case when I run the same command from Kolide. I guess this would be expected?
@theopolis just to shared with the community what the performance issue was, we identified the problem being disk write operations that came out of the systemd journal as it was getting the audit events once osquery enabled auditing. We had to mask those logs out of the journal with
systemctl mask --now systemd-journald-audit.socket
and that fixed the issue. Might be worth adding this to the documentation, happy to open a PR with this 🙂
n
oh that's interesting
t
That sounds like a more significant problem, (1) osquery is configuring audit, (2) somehow
auditd
is regaining control of the netlink socket and not reconfiguring based on
/etc/audit*
, (3) your disk is then taxed with the verbose amount of information audit writes. I think we should update the documentation to recommend people disable the
auditd
service via systemd.
n
I don't think this requires auditd to be running.
systemd-journald
collects logging data from various sources including (from man pages):
audit records, originating from the kernel audit subsystem
As in, it's recording audit messages from just having kernel auditing (what osquery enables) turned on, not auditd.
per https://osquery.readthedocs.io/en/stable/deployment/process-auditing/:
auditd
 should not be running when using osquery's process auditing, as it will conflict with 
osqueryd
 over access to the audit netlink socket. You should also ensure 
auditd
 is not configured to start at boot.
^ the recommendation is already that auditd is disabled
Though I hadn't heard about systemd-journald collecting these audit messages before, so disabling that would be useful to add to docs.
a
+1 to all your comments @nyanshak so auditd was not enabled in the box and it is the journal getting the audit logs directly from the kernel and this was having an important performance impact on the box though so definitely will be a great idea to add it as a recommendation on the docs
t
Are you interested in submitting a PR on GitHub to add the relevant notes/heads up to the docs
a
Sure, always wanted to contribute back to osquery so definitely once I get a minute this week will submit the PR