Hey folks :wave: We're seeing a bit of strangenes...
# general
p
Hey folks 👋 We're seeing a bit of strangeness with Kernel soft locks due to
kauditd
since Kernel
5.4.0-1069.73
(Ubuntu AWS package). Looking at the change set it appears there was a patch to the Kernel which "penalises user space the same as the kernel when under pressure". The cause seems to be the Kernel change, but it seems we're hitting it on systems under load with osquery deployed. Systems running the same osquery version with Kernels prior to this release do not observe this behaviour. The patch set can be found here: https://www.mail-archive.com/linux-audit@redhat.com/msg15527.html Any suggestions? 😄
n
@Peter, can you please describe which side effect are you exactly observing?
From the code I can see that sender application can be put to sleep long time(audit_backlog_wait_time) if audit_queue breaches audit_backlog_limit. Basically you can play with two parameters here audit_backlog_limit and audit_backlog_wait_time (changing these will have effect at other location in the code). you can set audit_backlog_limit to 0 to fail the if condition or set it to some higher value at the cost of more strain on system memory. you can set audit_backlog_wait_time to some smaller value so that process can wakeup earlier but at the cost of loosing some audit messages(e.g. syscall based audit messages). you can set audit_rate_limit to some lower value at the cost of dropping audit records. If osquery can read the audit messages faster then it can relieve some pressure off the audit system but there is some limit to it.
p
Hey there, Since updating the Kernel to a version which includes the following patches we're seeing CPU soft lockups when the system is loaded. These are Ubuntu / AWS Kernels. There are two potential patches which may be leading to the CPU Soft Lockups we're seeing on Kernels which have these patches, but not on systems which do not - with no changes between osquery configuration: https://www.mail-archive.com/linux-audit@redhat.com/msg15532.html https://www.mail-archive.com/linux-audit@redhat.com/msg15466.html The errors seem to indicate the triggers are
kauditd
. Most notably, the lockup seems to happen downstream from a call to
auditd_reset
. It seems like the Kernel patches may have some unexpected side-effects which lead to CPU soft lockups on affected Kernels.
Copy code
Apr 07 04:38:57 ip-10-0-74-131 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kauditd:22]
Apr 07 04:38:57 ip-10-0-74-131 kernel: Modules linked in: ppdev serio_raw parport_pc parport sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ena
Apr 07 04:38:57 ip-10-0-74-131 kernel: CPU: 1 PID: 22 Comm: kauditd Not tainted 5.4.0-1071-aws #76~18.04.1-Ubuntu
Apr 07 04:38:57 ip-10-0-74-131 kernel: Hardware name: Amazon EC2 t3.small/, BIOS 1.0 10/16/2017
Apr 07 04:38:57 ip-10-0-74-131 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
Apr 07 04:38:57 ip-10-0-74-131 kernel: Code: e8 b0 a3 65 ff 4c 29 e0 4c 39 f0 76 cf 80 0b 08 eb 8a 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 66 0c 68 ff 66 90 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 c6 07
Apr 07 04:38:57 ip-10-0-74-131 kernel: RSP: 0018:ffffa385c00cbe08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Apr 07 04:38:57 ip-10-0-74-131 kernel: RAX: 0000000000000001 RBX: ffffffff96fd3b70 RCX: 0000000000000769
Apr 07 04:38:57 ip-10-0-74-131 kernel: RDX: 000000000000076a RSI: 0000000000000246 RDI: 0000000000000246
Apr 07 04:38:57 ip-10-0-74-131 kernel: RBP: ffffa385c00cbe08 R08: 0000000000000000 R09: ffffffff958b4801
Apr 07 04:38:57 ip-10-0-74-131 kernel: R10: ffffa385c00cbd98 R11: 0000000000000000 R12: ffff907db699ff00
Apr 07 04:38:57 ip-10-0-74-131 kernel: R13: ffffffff96fd3b84 R14: ffff907db699ffd4 R15: ffff907db699ff00
Apr 07 04:38:57 ip-10-0-74-131 kernel: FS:  0000000000000000(0000) GS:ffff907dfb500000(0000) knlGS:0000000000000000
Apr 07 04:38:57 ip-10-0-74-131 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 07 04:38:57 ip-10-0-74-131 kernel: CR2: 000055c63d5d9ee0 CR3: 0000000008a0a006 CR4: 00000000007606e0
Apr 07 04:38:57 ip-10-0-74-131 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 07 04:38:57 ip-10-0-74-131 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 07 04:38:57 ip-10-0-74-131 kernel: PKRU: 55555554
Apr 07 04:38:57 ip-10-0-74-131 kernel: Call Trace:
Apr 07 04:38:57 ip-10-0-74-131 kernel:  skb_queue_head+0x47/0x50
Apr 07 04:38:57 ip-10-0-74-131 kernel:  kauditd_rehold_skb+0x18/0x20
Apr 07 04:38:57 ip-10-0-74-131 kernel:  kauditd_send_queue+0xb6/0xf0
Apr 07 04:38:57 ip-10-0-74-131 kernel:  ? kauditd_retry_skb+0x20/0x20
Apr 07 04:38:57 ip-10-0-74-131 kernel:  kauditd_thread+0xa0/0x250
Apr 07 04:38:57 ip-10-0-74-131 kernel:  ? __wake_up_pollfree+0x40/0x40
Apr 07 04:38:57 ip-10-0-74-131 kernel:  kthread+0x121/0x140
Apr 07 04:38:57 ip-10-0-74-131 kernel:  ? auditd_reset+0xa0/0xa0
Apr 07 04:39:35 ip-10-0-74-131 kernel: audit_log_start: 1773 callbacks suppressed
Apr 07 04:39:35 ip-10-0-74-131 kernel: audit: audit_backlog=4098 > audit_backlog_limit=4096
Apr 07 04:39:35 ip-10-0-74-131 kernel: audit: audit_lost=27 audit_rate_limit=0 audit_backlog_limit=4096
Apr 07 04:39:51 ip-10-0-74-131 kernel: audit: audit_backlog=4098 > audit_backlog_limit=4096
Apr 07 04:39:51 ip-10-0-74-131 kernel: audit: audit_lost=28 audit_rate_limit=0 audit_backlog_limit=4096
It looks like both RedHat and Canonical / Ubuntu are investigating: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1965723
A colleague of mine yesterday evening got a reliable repro of this issue and attempted to mitigate it using these flags:
Copy code
audit_backlog_wait_time
audit_backlog_limit
Unfortunately, it did not appear to prevent this issue from occuring.
Though the launchpad post above appears to reference the following commit as being the fix for the issue, which doesn't appear to have resolved the issue for us - as this is also present in the affected Kernel:
Copy code
audit: improve audit queue handling when "audit=1" on cmdline
It may just be a new "tipping point" issue introduced due to the addition of the "penalty" code, which is causing systems with only two cores and high load to become overwhelmed. That said, given the CPU lock up appears to be downstream from the call to
auditd_reset
it also may be possible that there's an additional edge-case not considered by these patches which is causing these stability issues. I'd expect the audit queue to be flushed and data lost, rather than a CPU being soft locked.