Peter
04/20/2022, 10:28 AMkernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kauditd:22]Downgrading to Kernel
5.4.0.1068.50
and rebooting the system with no other changes resolves the issue. Subsequent load spikes to not experience this issue. However, if we roll the Kernel forward to an "affected" version, we once again see the soft lockups.-a always,exit -S connect,accept,bind,execve,accept4,execveat
New printk'd messages prior to the lockup:
Apr 07 04:39:35 ip-10-0-74-131 kernel: audit_log_start: 1773 callbacks suppressed
Apr 07 04:39:35 ip-10-0-74-131 kernel: audit: audit_backlog=4098 > audit_backlog_limit=4096
Apr 07 04:39:35 ip-10-0-74-131 kernel: audit: audit_lost=27 audit_rate_limit=0 audit_backlog_limit=4096
Apr 07 04:39:51 ip-10-0-74-131 kernel: audit: audit_backlog=4098 > audit_backlog_limit=4096
Apr 07 04:39:51 ip-10-0-74-131 kernel: audit: audit_lost=28 audit_rate_limit=0 audit_backlog_limit=4096
Apr 07 04:38:57 ip-10-0-74-131 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kauditd:22]
Apr 07 04:38:57 ip-10-0-74-131 kernel: Modules linked in: ppdev serio_raw parport_pc parport sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ena
Apr 07 04:38:57 ip-10-0-74-131 kernel: CPU: 1 PID: 22 Comm: kauditd Not tainted 5.4.0-1071-aws #76~18.04.1-Ubuntu
Apr 07 04:38:57 ip-10-0-74-131 kernel: Hardware name: Amazon EC2 t3.small/, BIOS 1.0 10/16/2017
Apr 07 04:38:57 ip-10-0-74-131 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
Apr 07 04:38:57 ip-10-0-74-131 kernel: Code: e8 b0 a3 65 ff 4c 29 e0 4c 39 f0 76 cf 80 0b 08 eb 8a 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 66 0c 68 ff 66 90 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 c6 07
Apr 07 04:38:57 ip-10-0-74-131 kernel: RSP: 0018:ffffa385c00cbe08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Apr 07 04:38:57 ip-10-0-74-131 kernel: RAX: 0000000000000001 RBX: ffffffff96fd3b70 RCX: 0000000000000769
Apr 07 04:38:57 ip-10-0-74-131 kernel: RDX: 000000000000076a RSI: 0000000000000246 RDI: 0000000000000246
Apr 07 04:38:57 ip-10-0-74-131 kernel: RBP: ffffa385c00cbe08 R08: 0000000000000000 R09: ffffffff958b4801
Apr 07 04:38:57 ip-10-0-74-131 kernel: R10: ffffa385c00cbd98 R11: 0000000000000000 R12: ffff907db699ff00
Apr 07 04:38:57 ip-10-0-74-131 kernel: R13: ffffffff96fd3b84 R14: ffff907db699ffd4 R15: ffff907db699ff00
Apr 07 04:38:57 ip-10-0-74-131 kernel: FS: 0000000000000000(0000) GS:ffff907dfb500000(0000) knlGS:0000000000000000
Apr 07 04:38:57 ip-10-0-74-131 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 07 04:38:57 ip-10-0-74-131 kernel: CR2: 000055c63d5d9ee0 CR3: 0000000008a0a006 CR4: 00000000007606e0
Apr 07 04:38:57 ip-10-0-74-131 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 07 04:38:57 ip-10-0-74-131 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 07 04:38:57 ip-10-0-74-131 kernel: PKRU: 55555554
Apr 07 04:38:57 ip-10-0-74-131 kernel: Call Trace:
Apr 07 04:38:57 ip-10-0-74-131 kernel: skb_queue_head+0x47/0x50
Apr 07 04:38:57 ip-10-0-74-131 kernel: kauditd_rehold_skb+0x18/0x20
Apr 07 04:38:57 ip-10-0-74-131 kernel: kauditd_send_queue+0xb6/0xf0
Apr 07 04:38:57 ip-10-0-74-131 kernel: ? kauditd_retry_skb+0x20/0x20
Apr 07 04:38:57 ip-10-0-74-131 kernel: kauditd_thread+0xa0/0x250
Apr 07 04:38:57 ip-10-0-74-131 kernel: ? __wake_up_pollfree+0x40/0x40
Apr 07 04:38:57 ip-10-0-74-131 kernel: kthread+0x121/0x140
Apr 07 04:38:57 ip-10-0-74-131 kernel: ? auditd_reset+0xa0/0xa0
Stefano Bonicatti
04/20/2022, 12:16 PM--audit_backlog_wait_time=60000
and see if this improves anything?Peter
04/20/2022, 12:17 PMStefano Bonicatti
04/20/2022, 12:34 PMPeter
04/20/2022, 12:40 PMStefano Bonicatti
04/20/2022, 12:41 PMPeter
04/20/2022, 12:43 PM--audit_allow_config=true
--audit_allow_sockets=true
--audit_persist=true
--audit_allow_accept_socket_events=true
--audit_allow_null_accept_socket_events=true
--disable_audit=false
--disable_events=false
--enable_file_events=true
--events_expiry=1
--events_max=50000
--audit_backlog_wait_time=60000
--events_optimize=true
--host_identifier=uuid
--pack_delimiter=_
--verbose=false
--schedule_splay_percent=10
--logger_snapshot_event_type=true
--watchdog_memory_limit=350
--watchdog_utilization_limit=30
Stefano Bonicatti
04/20/2022, 12:45 PMPeter
04/20/2022, 12:47 PMauditd_reset
.
Due to the the new "penalty" code, I wonder if that's causing the call to auditd_reset
after this disconnection to be blocked in the kthread unexpectedly - as if it was the client that had been penalised?--verbose
the issue doesn't present itself 🤔
osqueryd --verbose --flagfile /etc/osquery/osquery.flags --config_path /etc/osquery/osquery.conf
osqueryd
running in the foreground which didn't have any issues, and started the service - via systemd - and ran the reproduction script, and it started soft locking up again.Stefano Bonicatti
04/20/2022, 1:09 PMPeter
04/20/2022, 1:09 PMStefano Bonicatti
04/20/2022, 1:11 PMPeter
04/20/2022, 1:15 PMStefano Bonicatti
04/20/2022, 1:16 PMPeter
04/20/2022, 1:22 PMauditd_reset
is called while the wait queue is "active"?
https://elixir.bootlin.com/linux/v5.4.189/source/kernel/audit.c#L1559audit_backlog_wait_time
. Well now I'm just even more confused :')audit.c
.
What's weird is that the diff for linux-aws 5.4.0-1068.72 to 5.4.0-1069.73 includes these same audit.c
patches, but I can't reproduce the issue on the new Kernel, whereas I can reliably on the older ones. 🤔