Hi I wonder if I could get some ideas on further troubleshoo osquery #general

Hi, I wonder if I could get some ideas on further ...

Alejandro

10/19/2020, 4:53 PM

Hi, I wonder if I could get some ideas on further troubleshooting a recent EC2 performance alert that seems was generated by osquery (version 4.5.1). The EC2 runs a high number of containers and the alert was a decreased on the EBS Burst Balance. I have followed some of the recommended troubleshooting so I could determine: • None of the scheduled queries got denylisted • There is an important number of process events in the machine • osquery_events table showed at one point there were up to 1 million process events • osquery_info db size got up to 300Mb Is there any way we could reduce IOPs operations by any of the settings for highly loaded machines that generate a lot of process events? Relevant osquery config is attached. Also is there any way we can get disk statistics on Linux so we can further determined that osquery was not the cause (I think it relates to this open PR https://github.com/osquery/osquery/issues/6288 but wanted to double check). Thanks

Alejandro

10/19/2020, 4:53 PM

And some of the relevant config:

Copy code

--audit_allow_config=true
--audit_persist=true
--disable_audit=false
--disable_distributed=false
--distributed_interval=10
--distributed_plugin=tls
--events_expiry=1
--events_max=500000
--logger_min_status=1
--logger_plugin=tls
--schedule_splay_percent=10
--disable_watchdog=false
--hash_delay=20
--pack_refresh_interval=3600
--table_delay=20
--watchdog_delay=120
--watchdog_level=1
--watchdog_memory_limit=200

theopolis

10/19/2020, 8:44 PM

Interesting, are there details in the alert about the length of the burst? Was there high IOPs for 60s, or for an hour, etc? Has the burst happened more than once? Is there something osquery-related you can correlate with?

theopolis

10/19/2020, 8:46 PM

(To answer your question, I cannot think of a way to get the relevant iostats right now within osquery, we'd need what that feature request asks for.) If the osquery daemon process is still running you can look at its current numbers and see if they are unreasonable.

theopolis

10/19/2020, 8:49 PM

/proc/$PID/io

should have the read/write bytes. You wont know what the rate was at the time of the alert. I also do not know how accurate these will be. For example, do they include stats about the audit netlink pipe?

Alejandro

10/20/2020, 11:50 AM

Thanks @theopolis for your answers appreciate your time in here. Regarding the alert, it was due to lower BurstBalance which shows the % of the burst allowance and was an inverted mountain shape that lasted for around 12 hours to get back to top 100%. Other AWS cloudwatch graphs for volume read / write OPS look normal and do not correlate with the burst one. We still trying to correlate if the problem is osquery or just a side effect. We deploy also osquery daemon with cgroups to also limit IOPs but logs do not indicate there was an issue there. Sadly osquery has got into a weird status where we get two errors continuously:

Copy code

watcher.cpp:360] osqueryd worker (194852) stopping: Memory limits exceeded: 213448000
udev.cpp:90] udev monitor returned invalid device: No buffer space available

and now we are not getting any logs out to the tls logger (events column on the

osquery_events

table has zero on all rows), and

config_valid

is 0.

Alejandro

10/20/2020, 11:56 AM

Also latest

osquery_schedule

logs we got into the SIEM did not indicate any of the schedule queries were added into a denylist.

theopolis

10/20/2020, 12:02 PM

That error loop you describe is occurring with 4.5.1?

Alejandro

10/20/2020, 12:51 PM

that is correct, we recently deployed the latest version of osquery

theopolis

10/20/2020, 5:23 PM

Would you mind creating a GitHub issue for this and try to include as much debugging info and if possible steps to reproduce? If the config is not valid then no queries would exist in the schedule and thus no logs. But why is the config invalid? <— this is the bug worth debugging IMO

Alejandro

10/21/2020, 9:05 AM

thanks @theopolis so I just noticed that the problem with the

config_valid

as 0, only comes when running it from

osqueryi

and is not the case when I run the same command from Kolide. I guess this would be expected?

Alejandro

10/23/2020, 9:23 AM

@theopolis just to shared with the community what the performance issue was, we identified the problem being disk write operations that came out of the systemd journal as it was getting the audit events once osquery enabled auditing. We had to mask those logs out of the journal with

systemctl mask --now systemd-journald-audit.socket

and that fixed the issue. Might be worth adding this to the documentation, happy to open a PR with this 🙂

nyanshak

10/23/2020, 2:53 PM

oh that's interesting

theopolis

10/23/2020, 3:11 PM

That sounds like a more significant problem, (1) osquery is configuring audit, (2) somehow

auditd

is regaining control of the netlink socket and not reconfiguring based on

/etc/audit*

, (3) your disk is then taxed with the verbose amount of information audit writes. I think we should update the documentation to recommend people disable the

auditd

service via systemd.

nyanshak

10/23/2020, 3:15 PM

I don't think this requires auditd to be running.

systemd-journald

collects logging data from various sources including (from man pages):

audit records, originating from the kernel audit subsystem

nyanshak

10/23/2020, 3:15 PM

https://manpages.debian.org/testing/systemd/systemd-journald-audit.socket.8.en.html

nyanshak

10/23/2020, 3:16 PM

As in, it's recording audit messages from just having kernel auditing (what osquery enables) turned on, not auditd.

nyanshak

10/23/2020, 3:16 PM

per https://osquery.readthedocs.io/en/stable/deployment/process-auditing/:

auditd
should not be running when using osquery's process auditing, as it will conflict with
osqueryd
over access to the audit netlink socket. You should also ensure
auditd
is not configured to start at boot.

nyanshak

10/23/2020, 3:16 PM

^ the recommendation is already that auditd is disabled

nyanshak

10/23/2020, 3:18 PM

Though I hadn't heard about systemd-journald collecting these audit messages before, so disabling that would be useful to add to docs.

Alejandro

10/26/2020, 11:28 AM

+1 to all your comments @nyanshak so auditd was not enabled in the box and it is the journal getting the audit logs directly from the kernel and this was having an important performance impact on the box though so definitely will be a great idea to add it as a recommendation on the docs

theopolis

10/26/2020, 10:14 PM

Are you interested in submitting a PR on GitHub to add the relevant notes/heads up to the docs

Alejandro

10/28/2020, 3:56 PM

Sure, always wanted to contribute back to osquery so definitely once I get a minute this week will submit the PR

58 Views

Open in Slack

Previous Next