<@UMM1J6Y04>, please provide more info
# general
n
@timb, please provide more info
t
after enabling force_reconfigure, a few hosts would immediately spewing audit_log_start: callback supressed and audit_lost messages, followed immediately by
Copy code
[  984.078179] Call Trace:
[  984.080639]  __schedule+0x2bc/0x89b
[  984.084134]  schedule+0x36/0x7c
[  984.087280]  schedule_preempt_disabled+0xe/0x10
[  984.091812]  __mutex_lock.isra.5+0x20c/0x634
[  984.096089]  ? release_sock+0x8f/0x94
which leads pretty quick to...
Copy code
[ 1106.935142] INFO: task systemd:1 blocked for more than 120 seconds.
n
1. what other audit flags you are using? 2. Can you provide output of 'auditctl -l'? 3. Can you provide more log message around this line but pertaining to audit? 4. Is system crashing or it becomes non-responsive?
t
these should be the same flags that were in place, beyond audit_force_reconfigure:
Copy code
[
  {"name":"audit_allow_config","value":"true"},
  {"name":"audit_allow_fim_events","value":"false"},
  {"name":"audit_allow_fork_process_events","value":"false"},
  {"name":"audit_allow_process_events","value":"true"},
  {"name":"audit_allow_selinux_events","value":"false"},
  {"name":"audit_allow_sockets","value":"true"},
  {"name":"audit_allow_unix","value":"false"},
  {"name":"audit_allow_user_events","value":"true"},
  {"name":"audit_debug","value":"false"},
  {"name":"audit_fim_debug","value":"false"},
  {"name":"audit_fim_show_accesses","value":"false"},
  {"name":"audit_force_reconfigure","value":"false"},
  {"name":"audit_force_unconfigure","value":"false"},
  {"name":"audit_persist","value":"true"},
  {"name":"audit_show_partial_fim_events","value":"false"},
  {"name":"audit_show_untracked_res_warnings","value":"false"},
  {"name":"disable_audit","value":"false"}
]
do you want auditctl w/ or w/o force_reconfigure enabled? and becoming non-responsive; systemd and sshd block, PAM blocks so you can’t log in, but it still responds to ping, etc
load pretty much infinitely spikes
i didn’t see any log messages specific to audit other than audit_lost type messages; i can do some looking and scrub them a bit soon if you think it’d be helpful
this is osquery 4.0.2 on newer 4x kernels; audit via osquery has been working fine on some of these for >1yr until the force_reconfigure flag was added
with and without force_reconfigure the only difference appears to be the PID
Copy code
$ sudo auditctl -s
enabled 1
failure 0
pid 26178
rate_limit 0
backlog_limit 4096
lost 0
backlog 0
loginuid_immutable 0 unlocked
after that is just a sea of audit_lost and audit_backlog messages. let me know if there’s anything else that would be useful
n
@timb, 'auditctl -l' with force reconfigure
t
^ also, what does that look like before and after starting osquery.
t
@npamnani auditctl -l looks the same with and without force reconfigure
ah, -l, sorry! i gave you -s, one moment
osquery running w/o force_reconfigure:
Copy code
$ sudo auditctl -l;echo; sudo auditctl -s
-a always,exit -S connect
-a always,exit -S bind
-a always,exit -S execve
-a always,exit -S execveat

enabled 1
failure 0
pid 19405
rate_limit 0
backlog_limit 4096
lost 0
backlog 0
loginuid_immutable 0 unlocked
osquery stopped:
Copy code
$ sudo auditctl -l;echo; sudo auditctl -s
No rules

enabled 0
failure 1
pid 19405
rate_limit 0
backlog_limit 0
lost 0
backlog 0
loginuid_immutable 0 unlocked
no difference in either with force_reconfigure on:
Copy code
$ sudo auditctl -l;echo; sudo auditctl -s
-a always,exit -S connect
-a always,exit -S bind
-a always,exit -S execve
-a always,exit -S execveat

enabled 1
failure 0
pid 32732
rate_limit 0
backlog_limit 4096
lost 0
backlog 0
loginuid_immutable 0 unlocked
this is a box that locked up previously, but i really have no way to be sure if it’s in the same state because it was rebooted to recover; we have seen at least one box where it was rebooted to fix this issue and immediately went back into this state as osqueryd came up during startup
n
thanks @timb, will update you after analyzing
👍 1
t
much appreciated! let me know if there’s anything else i can do to help
n
@timb, found reason
t
that was... fast haha
n
I had some knowledge around this code
confirming how this problem can stall sshd
if you run your osquery with verbose then you should see lots of "audit rules could not be correctly remove" message
please check
this check is not correct
since -
struct audit_message { struct nlmsghdr nlh; char data[MAX_AUDIT_MESSAGE_LENGTH]; }; struct audit_reply { int type; int len; struct nlmsghdr *nlh; struct audit_message msg;
this means rule_object.size() will always be greater than MAX_AUDIT_MESSAGE_LENGTH
so check passes and false is returned.
without closing the audit_netlink_handle_ , audit_netlink_handle_ is being set to -1
so fd leak here
now final question, why it is causing sshd to stall?
look at acquireHandle , this is in a tight loop. Which is calling clearAuditConfiguration and which continuously requesting of audit service in the form of audit rules. which ultimately call audit_receive of kernel
via control path -
2019-10-23T120349.966113+00:00 host kernel: [ 1229.872721] mutex_lock+0x2f/0x3a 2019-10-23T120349.969436+00:00 host kernel: [ 1229.876045] audit_receive+0x2b/0x9b 2019-10-23T120349.973024+00:00 host kernel: [ 1229.879631] netlink_unicast+0x18d/0x255 2019-10-23T120349.976959+00:00 host kernel: [ 1229.883557] netlink_sendmsg+0x2df/0x3cc 2019-10-23T120349.980877+00:00 host kernel: [ 1229.887487] sock_sendmsg+0x3e/0x4a 2019-10-23T120349.984373+00:00 host kernel: [ 1229.890981] SYSC_sendto+0x102/0x18b 2019-10-23T120349.987938+00:00 host kernel: [ 1229.894562] ? __audit_syscall_entry+0xac/0xef 2019-10-23T120349.992404+00:00 host kernel: [ 1229.899012] ? syscall_trace_enter+0x1ce/0x2b8 2019-10-23T120349.996849+00:00 host kernel: [ 1229.903454] SyS_sendto+0xe/0x10 2019-10-23T120350.000079+00:00 host kernel: [ 1229.906687] do_syscall_64+0x79/0x1b9 2019-10-23T120350.008792+00:00 host kernel: [ 1229.910351]
t
hmm... in this case i don’t see that message with --verbose; i can try to get a few others later today to verify, but most of the hosts where this issue presented aren’t ones the teams want me melting
n
static void audit_receive(struct sk_buff *skb) { struct nlmsghdr *nlh; /* * len MUST be signed for nlmsg_next to be able to dec it below 0 * if the nlmsg_len was not aligned */ int len; int err; nlh = nlmsg_hdr(skb); len = skb->len; mutex_lock(&audit_cmd_mutex);
t
(which is always the case, isn’t it...)
n
from kernel source
and sshd not able to acquire the mutex audit_cmd_mutex
this happen when used with reconfigure
first time, there are no rules install then no issue
t
because the problem is in deleteAuditRule?
n
once rules are insalled and error happen and code goes to reacquire the netlink socket then control gets in this vicious loop
I will create a fix tomorrow , hopefully it will fix your problem
👍 1
three problems
fd leak, tight loop and a check in deleteAuditRule
t
that would be great! i really appreciate your attention
t
epic
👍 3
i
truly epic. thank you so much @npamnani
n
sorry for delay, I definitely fix this time