Title
#general
n

npamnani

10/24/2019, 4:43 AM
@timb, please provide more info
t

timb

10/24/2019, 4:51 AM
after enabling force_reconfigure, a few hosts would immediately spewing audit_log_start: callback supressed and audit_lost messages, followed immediately by
[  984.078179] Call Trace:
[  984.080639]  __schedule+0x2bc/0x89b
[  984.084134]  schedule+0x36/0x7c
[  984.087280]  schedule_preempt_disabled+0xe/0x10
[  984.091812]  __mutex_lock.isra.5+0x20c/0x634
[  984.096089]  ? release_sock+0x8f/0x94
4:52 AM
which leads pretty quick to...
[ 1106.935142] INFO: task systemd:1 blocked for more than 120 seconds.
n

npamnani

10/24/2019, 7:42 AM
1. what other audit flags you are using? 2. Can you provide output of 'auditctl -l'? 3. Can you provide more log message around this line but pertaining to audit? 4. Is system crashing or it becomes non-responsive?
t

timb

10/24/2019, 5:15 PM
these should be the same flags that were in place, beyond audit_force_reconfigure:
[
  {"name":"audit_allow_config","value":"true"},
  {"name":"audit_allow_fim_events","value":"false"},
  {"name":"audit_allow_fork_process_events","value":"false"},
  {"name":"audit_allow_process_events","value":"true"},
  {"name":"audit_allow_selinux_events","value":"false"},
  {"name":"audit_allow_sockets","value":"true"},
  {"name":"audit_allow_unix","value":"false"},
  {"name":"audit_allow_user_events","value":"true"},
  {"name":"audit_debug","value":"false"},
  {"name":"audit_fim_debug","value":"false"},
  {"name":"audit_fim_show_accesses","value":"false"},
  {"name":"audit_force_reconfigure","value":"false"},
  {"name":"audit_force_unconfigure","value":"false"},
  {"name":"audit_persist","value":"true"},
  {"name":"audit_show_partial_fim_events","value":"false"},
  {"name":"audit_show_untracked_res_warnings","value":"false"},
  {"name":"disable_audit","value":"false"}
]
5:16 PM
do you want auditctl w/ or w/o force_reconfigure enabled? and becoming non-responsive; systemd and sshd block, PAM blocks so you can’t log in, but it still responds to ping, etc
5:16 PM
load pretty much infinitely spikes
5:17 PM
i didn’t see any log messages specific to audit other than audit_lost type messages; i can do some looking and scrub them a bit soon if you think it’d be helpful
5:18 PM
this is osquery 4.0.2 on newer 4x kernels; audit via osquery has been working fine on some of these for >1yr until the force_reconfigure flag was added
5:27 PM
with and without force_reconfigure the only difference appears to be the PID
$ sudo auditctl -s
enabled 1
failure 0
pid 26178
rate_limit 0
backlog_limit 4096
lost 0
backlog 0
loginuid_immutable 0 unlocked
5:48 PM
after that is just a sea of audit_lost and audit_backlog messages. let me know if there’s anything else that would be useful
n

npamnani

10/24/2019, 6:09 PM
@timb, 'auditctl -l' with force reconfigure
theopolis

theopolis

10/24/2019, 6:40 PM
^ also, what does that look like before and after starting osquery.
t

timb

10/24/2019, 7:03 PM
@npamnani auditctl -l looks the same with and without force reconfigure
7:04 PM
ah, -l, sorry! i gave you -s, one moment
7:07 PM
osquery running w/o force_reconfigure:
$ sudo auditctl -l;echo; sudo auditctl -s
-a always,exit -S connect
-a always,exit -S bind
-a always,exit -S execve
-a always,exit -S execveat

enabled 1
failure 0
pid 19405
rate_limit 0
backlog_limit 4096
lost 0
backlog 0
loginuid_immutable 0 unlocked
osquery stopped:
$ sudo auditctl -l;echo; sudo auditctl -s
No rules

enabled 0
failure 1
pid 19405
rate_limit 0
backlog_limit 0
lost 0
backlog 0
loginuid_immutable 0 unlocked
7:08 PM
no difference in either with force_reconfigure on:
$ sudo auditctl -l;echo; sudo auditctl -s
-a always,exit -S connect
-a always,exit -S bind
-a always,exit -S execve
-a always,exit -S execveat

enabled 1
failure 0
pid 32732
rate_limit 0
backlog_limit 4096
lost 0
backlog 0
loginuid_immutable 0 unlocked
7:10 PM
this is a box that locked up previously, but i really have no way to be sure if it’s in the same state because it was rebooted to recover; we have seen at least one box where it was rebooted to fix this issue and immediately went back into this state as osqueryd came up during startup
n

npamnani

10/24/2019, 7:46 PM
thanks @timb, will update you after analyzing
t

timb

10/24/2019, 7:52 PM
much appreciated! let me know if there’s anything else i can do to help
n

npamnani

10/24/2019, 8:06 PM
@timb, found reason
t

timb

10/24/2019, 8:07 PM
that was... fast haha
n

npamnani

10/24/2019, 8:09 PM
I had some knowledge around this code
8:10 PM
confirming how this problem can stall sshd
8:10 PM
if you run your osquery with verbose then you should see lots of "audit rules could not be correctly remove" message
8:11 PM
please check
8:11 PM
this check is not correct
8:12 PM
since -
8:12 PM
struct audit_message { struct nlmsghdr nlh; char data[MAX_AUDIT_MESSAGE_LENGTH]; }; struct audit_reply { int type; int len; struct nlmsghdr *nlh; struct audit_message msg;
8:13 PM
this means rule_object.size() will always be greater than MAX_AUDIT_MESSAGE_LENGTH
8:13 PM
so check passes and false is returned.
8:15 PM
without closing the audit_netlink_handle_ , audit_netlink_handle_ is being set to -1
8:15 PM
so fd leak here
8:15 PM
now final question, why it is causing sshd to stall?
8:20 PM
look at acquireHandle , this is in a tight loop. Which is calling clearAuditConfiguration and which continuously requesting of audit service in the form of audit rules. which ultimately call audit_receive of kernel
8:21 PM
via control path -
8:21 PM
2019-10-23T12:03:49.966113+00:00 host kernel: [ 1229.872721] mutex_lock+0x2f/0x3a 2019-10-23T12:03:49.969436+00:00 host kernel: [ 1229.876045] audit_receive+0x2b/0x9b 2019-10-23T12:03:49.973024+00:00 host kernel: [ 1229.879631] netlink_unicast+0x18d/0x255 2019-10-23T12:03:49.976959+00:00 host kernel: [ 1229.883557] netlink_sendmsg+0x2df/0x3cc 2019-10-23T12:03:49.980877+00:00 host kernel: [ 1229.887487] sock_sendmsg+0x3e/0x4a 2019-10-23T12:03:49.984373+00:00 host kernel: [ 1229.890981] SYSC_sendto+0x102/0x18b 2019-10-23T12:03:49.987938+00:00 host kernel: [ 1229.894562] ? __audit_syscall_entry+0xac/0xef 2019-10-23T12:03:49.992404+00:00 host kernel: [ 1229.899012] ? syscall_trace_enter+0x1ce/0x2b8 2019-10-23T12:03:49.996849+00:00 host kernel: [ 1229.903454] SyS_sendto+0xe/0x10 2019-10-23T12:03:50.000079+00:00 host kernel: [ 1229.906687] do_syscall_64+0x79/0x1b9 2019-10-23T12:03:50.008792+00:00 host kernel: [ 1229.910351]
t

timb

10/24/2019, 8:21 PM
hmm... in this case i don’t see that message with --verbose; i can try to get a few others later today to verify, but most of the hosts where this issue presented aren’t ones the teams want me melting
n

npamnani

10/24/2019, 8:22 PM
static void audit_receive(struct sk_buff *skb) { struct nlmsghdr nlh; / * len MUST be signed for nlmsg_next to be able to dec it below 0 * if the nlmsg_len was not aligned */ int len; int err; nlh = nlmsg_hdr(skb); len = skb->len; mutex_lock(&audit_cmd_mutex);
t

timb

10/24/2019, 8:22 PM
(which is always the case, isn’t it...)
n

npamnani

10/24/2019, 8:22 PM
from kernel source
8:23 PM
and sshd not able to acquire the mutex audit_cmd_mutex
8:23 PM
this happen when used with reconfigure
8:24 PM
first time, there are no rules install then no issue
t

timb

10/24/2019, 8:25 PM
because the problem is in deleteAuditRule?
n

npamnani

10/24/2019, 8:25 PM
once rules are insalled and error happen and code goes to reacquire the netlink socket then control gets in this vicious loop
8:26 PM
I will create a fix tomorrow , hopefully it will fix your problem
8:26 PM
three problems
8:27 PM
fd leak, tight loop and a check in deleteAuditRule
t

timb

10/24/2019, 8:28 PM
that would be great! i really appreciate your attention
theopolis

theopolis

10/24/2019, 8:37 PM
epic
i

Ivanlei

10/25/2019, 4:46 PM
truly epic. thank you so much @npamnani
n

npamnani

10/27/2019, 7:30 AM
7:34 AM
sorry for delay, I definitely fix this time