Hey again :wave: Just a note, with a CPU soft loc...
# general
p
Hey again 👋 Just a note, with a CPU soft lockup issue I mentioned above. It appears that in Ubunut's 5.4 LTS Kernel two "new" linux-audit patches were bundled: • [RFC PATCH] audit: improve audit queue handling when "audit=1" on cmdline • [PATCH] audit: ensure userspace is penalized the same as the kernel when under pressure These first made it into the AWS, Azure, GKE, and GCP Ubuntu Kernels with versions greater than 5.4.0-1068. Unfortunately, this appears to have a detrimental impact on systems running osquery - or with audit rules installed which generate quite a lot of events. Under load we see kernels above this version reliably killing systems entirely due to CPU soft lockups.
kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kauditd:22]
Downgrading to Kernel
5.4.0.1068.50
and rebooting the system with no other changes resolves the issue. Subsequent load spikes to not experience this issue. However, if we roll the Kernel forward to an "affected" version, we once again see the soft lockups.
The osquery managed audit rules are as follows:
Copy code
-a always,exit -S connect,accept,bind,execve,accept4,execveat
New printk'd messages prior to the lockup:
Copy code
Apr 07 04:39:35 ip-10-0-74-131 kernel: audit_log_start: 1773 callbacks suppressed
Apr 07 04:39:35 ip-10-0-74-131 kernel: audit: audit_backlog=4098 > audit_backlog_limit=4096
Apr 07 04:39:35 ip-10-0-74-131 kernel: audit: audit_lost=27 audit_rate_limit=0 audit_backlog_limit=4096
Apr 07 04:39:51 ip-10-0-74-131 kernel: audit: audit_backlog=4098 > audit_backlog_limit=4096
Apr 07 04:39:51 ip-10-0-74-131 kernel: audit: audit_lost=28 audit_rate_limit=0 audit_backlog_limit=4096
Immediately followed by a trace:
Copy code
Apr 07 04:38:57 ip-10-0-74-131 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kauditd:22]
Apr 07 04:38:57 ip-10-0-74-131 kernel: Modules linked in: ppdev serio_raw parport_pc parport sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ena
Apr 07 04:38:57 ip-10-0-74-131 kernel: CPU: 1 PID: 22 Comm: kauditd Not tainted 5.4.0-1071-aws #76~18.04.1-Ubuntu
Apr 07 04:38:57 ip-10-0-74-131 kernel: Hardware name: Amazon EC2 t3.small/, BIOS 1.0 10/16/2017
Apr 07 04:38:57 ip-10-0-74-131 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
Apr 07 04:38:57 ip-10-0-74-131 kernel: Code: e8 b0 a3 65 ff 4c 29 e0 4c 39 f0 76 cf 80 0b 08 eb 8a 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 66 0c 68 ff 66 90 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 c6 07
Apr 07 04:38:57 ip-10-0-74-131 kernel: RSP: 0018:ffffa385c00cbe08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Apr 07 04:38:57 ip-10-0-74-131 kernel: RAX: 0000000000000001 RBX: ffffffff96fd3b70 RCX: 0000000000000769
Apr 07 04:38:57 ip-10-0-74-131 kernel: RDX: 000000000000076a RSI: 0000000000000246 RDI: 0000000000000246
Apr 07 04:38:57 ip-10-0-74-131 kernel: RBP: ffffa385c00cbe08 R08: 0000000000000000 R09: ffffffff958b4801
Apr 07 04:38:57 ip-10-0-74-131 kernel: R10: ffffa385c00cbd98 R11: 0000000000000000 R12: ffff907db699ff00
Apr 07 04:38:57 ip-10-0-74-131 kernel: R13: ffffffff96fd3b84 R14: ffff907db699ffd4 R15: ffff907db699ff00
Apr 07 04:38:57 ip-10-0-74-131 kernel: FS:  0000000000000000(0000) GS:ffff907dfb500000(0000) knlGS:0000000000000000
Apr 07 04:38:57 ip-10-0-74-131 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 07 04:38:57 ip-10-0-74-131 kernel: CR2: 000055c63d5d9ee0 CR3: 0000000008a0a006 CR4: 00000000007606e0
Apr 07 04:38:57 ip-10-0-74-131 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 07 04:38:57 ip-10-0-74-131 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 07 04:38:57 ip-10-0-74-131 kernel: PKRU: 55555554
Apr 07 04:38:57 ip-10-0-74-131 kernel: Call Trace:
Apr 07 04:38:57 ip-10-0-74-131 kernel:  skb_queue_head+0x47/0x50
Apr 07 04:38:57 ip-10-0-74-131 kernel:  kauditd_rehold_skb+0x18/0x20
Apr 07 04:38:57 ip-10-0-74-131 kernel:  kauditd_send_queue+0xb6/0xf0
Apr 07 04:38:57 ip-10-0-74-131 kernel:  ? kauditd_retry_skb+0x20/0x20
Apr 07 04:38:57 ip-10-0-74-131 kernel:  kauditd_thread+0xa0/0x250
Apr 07 04:38:57 ip-10-0-74-131 kernel:  ? __wake_up_pollfree+0x40/0x40
Apr 07 04:38:57 ip-10-0-74-131 kernel:  kthread+0x121/0x140
Apr 07 04:38:57 ip-10-0-74-131 kernel:  ? auditd_reset+0xa0/0xa0
I can provide an easy script to generate a repro for anyone who would be interested in helping me validate. I've seen postings on the Ubuntu Launchpad tracker and Stack Overflow which seem very similar, and indicate that this issue occurs in Ubuntu 20.04, as well as the associated AWS, Azure, and "generic" Kernels. It feels to me to be one of two things: 1. One of the patches has an unexpected condition which is leading to this behaviour on loaded systems. 2. The new "penalty" code makes it impossible for osquery to consume audit data on a loaded system in a timely manner. Though this doesn't seem to explain the lockups directly.
s
@Peter could you try passing to osquery
--audit_backlog_wait_time=60000
and see if this improves anything?
p
Sure can, just a mo.
Unfortunately not. Same issue, CPU soft lockup under load.
s
I see; I only gave this a quick look but surely is confusing. The first patch you linked seems to have been done to actually fix an issue about soft lockups: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1965723 So the combination of the two somehow made it worse? Also I’m curious, what’s the osquery CPU usage in all this?
Also, putting the wait time to 60k makes the lockup happen sooner, later or at the same time?
p
Yeah, that's what has us a bit perplexed. The Kernel we're using definitely has the "fix" patch, but we can reliably reproduce the issue under load. It definitely seems to take longer before the soft lockups are seen with this wait time set. Just trying to get the load information for you now, which is a bit challenging as the system stops responding to terminal input entirely so we may only get information from just before the system locks up.
s
actually, now that I remember, have you verified that osquery is not getting killed by the watchdog?
because as far as I recall when there’s high activity like that, the default limits of 200MB of RAM won’t cut it
p
Let me check around the watchdog. Thanks for the assistance, by the way! Our full flags file, as we increase the memory limit slightly:
Copy code
--audit_allow_config=true
--audit_allow_sockets=true
--audit_persist=true
--audit_allow_accept_socket_events=true
--audit_allow_null_accept_socket_events=true
--disable_audit=false
--disable_events=false
--enable_file_events=true
--events_expiry=1
--events_max=50000
--audit_backlog_wait_time=60000
--events_optimize=true
--host_identifier=uuid
--pack_delimiter=_
--verbose=false
--schedule_splay_percent=10
--logger_snapshot_event_type=true
--watchdog_memory_limit=350
--watchdog_utilization_limit=30
s
So, I would check if osquery is getting killed. A way to understand that more quickly is to launch osquery in the shell so that the watchdog process is able to write the message about cpu or memory limits being hit (otherwise the message gets lost because the watchdog doesn’t implement the logger plugins that permits it to write them on file or via TLS)
I understand that the behavior of osquery and the watchdog would be the same pre and post patch, but to rule out issues, this might amplify the issue you’re seeing post patch
p
Gotcha, let me give that a try. In the mean time, here's the last tmux frame before the system became entirely unresponsive. It's interesting that pid 930 of osqueryd is in an uninterruptible sleep. That said, the CPU and memory seems okay?
My knowledge of the kauditd internals isn't great, but I wonder whether osqueryd is disconnecting from the audit socket perhaps due to watchdog termination, which is triggering unexpected behaviour downstream from the call to
auditd_reset
. Due to the the new "penalty" code, I wonder if that's causing the call to
auditd_reset
after this disconnection to be blocked in the kthread unexpectedly - as if it was the client that had been penalised?
That's strange. If I run osqueryd in the foreground, using the same flags - with the addition of
--verbose
the issue doesn't present itself 🤔
Copy code
osqueryd --verbose --flagfile /etc/osquery/osquery.flags --config_path /etc/osquery/osquery.conf
I just ^C the
osqueryd
running in the foreground which didn't have any issues, and started the service - via systemd - and ran the reproduction script, and it started soft locking up again.
s
I was going to say that there’s a limit in systemd
p
I just disabled this CPU Quota line and running in systemd it no longer locks.
I have a strong feeling this is something wonky with the penalty code when the client disconnects from the audit socket.
s
though I don’t understand why osquery should disconnect. Now the problem is understanding if osquery was getting killed due to too much memory usage. I have a PR open since a while which was addressing the issue of osquery not having enough CPU and especially having the audit socket reading thread going faster than what the other processing threads could go
and an internal buffer would grow “indefinitely”, causing the memory limit being hit
p
Interesting, yeah, that's a good question. Without that CPUQuota flag the reproduction script completes and osquery keeps running. Let me run it under sar to see what happens with memory utilisation
s
I’m probably not seeing some interaction in the penalty code, but all it seemed to me was a sleep in audit_receive, which btw osquery by default sets it to 0, so I would’ve expected that change to do nothing.
p
The semantics behind why osquery might be disconnected from the socket is still a mystery - if that is occuring. That said, in that case what happens if
auditd_reset
is called while the wait queue is "active"? https://elixir.bootlin.com/linux/v5.4.189/source/kernel/audit.c#L1559
Thanks again for the assistance here. I'm going to go do some more reading on the topic, and have a think about where I can get a debugger attached to get better visibility into what's occurring. Be back in a few.
Ah I see what you mean around
audit_backlog_wait_time
. Well now I'm just even more confused :')
Well now I'm just even more confused :')
Looping back on this one, it appears that the AWS 1072.77 Kernel resolves this issue. The issue is no-longer triggerable in this Kernel. According to the following diff, the 1072.77 Kernel includes some changes to
audit.c
. What's weird is that the diff for linux-aws 5.4.0-1068.72 to 5.4.0-1069.73 includes these same
audit.c
patches, but I can't reproduce the issue on the new Kernel, whereas I can reliably on the older ones. 🤔