< timb> please provide more info osquery #general

Join Slack

<@UMM1J6Y04>, please provide more info

# general

npamnani

10/24/2019, 4:43 AM

@timb, please provide more info

timb

10/24/2019, 4:51 AM

after enabling force_reconfigure, a few hosts would immediately spewing audit_log_start: callback supressed and audit_lost messages, followed immediately by

Copy code

[  984.078179] Call Trace:
[  984.080639]  __schedule+0x2bc/0x89b
[  984.084134]  schedule+0x36/0x7c
[  984.087280]  schedule_preempt_disabled+0xe/0x10
[  984.091812]  __mutex_lock.isra.5+0x20c/0x634
[  984.096089]  ? release_sock+0x8f/0x94

timb

10/24/2019, 4:52 AM

which leads pretty quick to...

Copy code

[ 1106.935142] INFO: task systemd:1 blocked for more than 120 seconds.

npamnani

10/24/2019, 7:42 AM

1. what other audit flags you are using? 2. Can you provide output of 'auditctl -l'? 3. Can you provide more log message around this line but pertaining to audit? 4. Is system crashing or it becomes non-responsive?

timb

10/24/2019, 5:15 PM

these should be the same flags that were in place, beyond audit_force_reconfigure:

Copy code

[
  {"name":"audit_allow_config","value":"true"},
  {"name":"audit_allow_fim_events","value":"false"},
  {"name":"audit_allow_fork_process_events","value":"false"},
  {"name":"audit_allow_process_events","value":"true"},
  {"name":"audit_allow_selinux_events","value":"false"},
  {"name":"audit_allow_sockets","value":"true"},
  {"name":"audit_allow_unix","value":"false"},
  {"name":"audit_allow_user_events","value":"true"},
  {"name":"audit_debug","value":"false"},
  {"name":"audit_fim_debug","value":"false"},
  {"name":"audit_fim_show_accesses","value":"false"},
  {"name":"audit_force_reconfigure","value":"false"},
  {"name":"audit_force_unconfigure","value":"false"},
  {"name":"audit_persist","value":"true"},
  {"name":"audit_show_partial_fim_events","value":"false"},
  {"name":"audit_show_untracked_res_warnings","value":"false"},
  {"name":"disable_audit","value":"false"}
]

timb

10/24/2019, 5:16 PM

do you want auditctl w/ or w/o force_reconfigure enabled? and becoming non-responsive; systemd and sshd block, PAM blocks so you can’t log in, but it still responds to ping, etc

timb

10/24/2019, 5:16 PM

load pretty much infinitely spikes

timb

10/24/2019, 5:17 PM

i didn’t see any log messages specific to audit other than audit_lost type messages; i can do some looking and scrub them a bit soon if you think it’d be helpful

timb

10/24/2019, 5:18 PM

this is osquery 4.0.2 on newer 4x kernels; audit via osquery has been working fine on some of these for >1yr until the force_reconfigure flag was added

timb

10/24/2019, 5:27 PM

with and without force_reconfigure the only difference appears to be the PID

Copy code

$ sudo auditctl -s
enabled 1
failure 0
pid 26178
rate_limit 0
backlog_limit 4096
lost 0
backlog 0
loginuid_immutable 0 unlocked

timb

10/24/2019, 5:47 PM

audit_crash

timb

10/24/2019, 5:48 PM

after that is just a sea of audit_lost and audit_backlog messages. let me know if there’s anything else that would be useful

npamnani

10/24/2019, 6:09 PM

@timb, 'auditctl -l' with force reconfigure

theopolis

10/24/2019, 6:40 PM

^ also, what does that look like before and after starting osquery.

timb

10/24/2019, 7:03 PM

@npamnani auditctl -l looks the same with and without force reconfigure

timb

10/24/2019, 7:04 PM

ah, -l, sorry! i gave you -s, one moment

timb

10/24/2019, 7:07 PM

osquery running w/o force_reconfigure:

Copy code

$ sudo auditctl -l;echo; sudo auditctl -s
-a always,exit -S connect
-a always,exit -S bind
-a always,exit -S execve
-a always,exit -S execveat

enabled 1
failure 0
pid 19405
rate_limit 0
backlog_limit 4096
lost 0
backlog 0
loginuid_immutable 0 unlocked

osquery stopped:

Copy code

$ sudo auditctl -l;echo; sudo auditctl -s
No rules

enabled 0
failure 1
pid 19405
rate_limit 0
backlog_limit 0
lost 0
backlog 0
loginuid_immutable 0 unlocked

timb

10/24/2019, 7:08 PM

no difference in either with force_reconfigure on:

Copy code

$ sudo auditctl -l;echo; sudo auditctl -s
-a always,exit -S connect
-a always,exit -S bind
-a always,exit -S execve
-a always,exit -S execveat

enabled 1
failure 0
pid 32732
rate_limit 0
backlog_limit 4096
lost 0
backlog 0
loginuid_immutable 0 unlocked

timb

10/24/2019, 7:10 PM

this is a box that locked up previously, but i really have no way to be sure if it’s in the same state because it was rebooted to recover; we have seen at least one box where it was rebooted to fix this issue and immediately went back into this state as osqueryd came up during startup

npamnani

10/24/2019, 7:46 PM

thanks @timb, will update you after analyzing

👍 1

timb

10/24/2019, 7:52 PM

much appreciated! let me know if there’s anything else i can do to help

npamnani

10/24/2019, 8:06 PM

@timb, found reason

timb

10/24/2019, 8:07 PM

that was... fast haha

npamnani

10/24/2019, 8:09 PM

I had some knowledge around this code

npamnani

10/24/2019, 8:10 PM

confirming how this problem can stall sshd

npamnani

10/24/2019, 8:10 PM

if you run your osquery with verbose then you should see lots of "audit rules could not be correctly remove" message

npamnani

10/24/2019, 8:11 PM

please check

npamnani

10/24/2019, 8:11 PM

problem start with check at https://github.com/osquery/osquery/blob/master/osquery/events/linux/auditdnetlink.cpp#L503

npamnani

10/24/2019, 8:11 PM

this check is not correct

npamnani

10/24/2019, 8:12 PM

since -

npamnani

10/24/2019, 8:12 PM

struct audit_message { struct nlmsghdr nlh; char data[MAX_AUDIT_MESSAGE_LENGTH]; }; struct audit_reply { int type; int len; struct nlmsghdr *nlh; struct audit_message msg;

npamnani

10/24/2019, 8:13 PM

this means rule_object.size() will always be greater than MAX_AUDIT_MESSAGE_LENGTH

npamnani

10/24/2019, 8:13 PM

so check passes and false is returned.

npamnani

10/24/2019, 8:14 PM

Again at the call site of clearAuditConfiguration https://github.com/osquery/osquery/blob/master/osquery/events/linux/auditdnetlink.cpp#L651

npamnani

10/24/2019, 8:15 PM

without closing the audit_netlink_handle_ , audit_netlink_handle_ is being set to -1

npamnani

10/24/2019, 8:15 PM

so fd leak here

npamnani

10/24/2019, 8:15 PM

now final question, why it is causing sshd to stall?

npamnani

10/24/2019, 8:20 PM

look at acquireHandle , this is in a tight loop. Which is calling clearAuditConfiguration and which continuously requesting of audit service in the form of audit rules. which ultimately call audit_receive of kernel

npamnani

10/24/2019, 8:21 PM

via control path -

npamnani

10/24/2019, 8:21 PM

2019-10-23T120349.966113+00:00 host kernel: [ 1229.872721] mutex_lock+0x2f/0x3a 2019-10-23T120349.969436+00:00 host kernel: [ 1229.876045] audit_receive+0x2b/0x9b 2019-10-23T120349.973024+00:00 host kernel: [ 1229.879631] netlink_unicast+0x18d/0x255 2019-10-23T120349.976959+00:00 host kernel: [ 1229.883557] netlink_sendmsg+0x2df/0x3cc 2019-10-23T120349.980877+00:00 host kernel: [ 1229.887487] sock_sendmsg+0x3e/0x4a 2019-10-23T120349.984373+00:00 host kernel: [ 1229.890981] SYSC_sendto+0x102/0x18b 2019-10-23T120349.987938+00:00 host kernel: [ 1229.894562] ? __audit_syscall_entry+0xac/0xef 2019-10-23T120349.992404+00:00 host kernel: [ 1229.899012] ? syscall_trace_enter+0x1ce/0x2b8 2019-10-23T120349.996849+00:00 host kernel: [ 1229.903454] SyS_sendto+0xe/0x10 2019-10-23T120350.000079+00:00 host kernel: [ 1229.906687] do_syscall_64+0x79/0x1b9 2019-10-23T120350.008792+00:00 host kernel: [ 1229.910351]

timb

10/24/2019, 8:21 PM

hmm... in this case i don’t see that message with --verbose; i can try to get a few others later today to verify, but most of the hosts where this issue presented aren’t ones the teams want me melting

npamnani

10/24/2019, 8:22 PM

static void audit_receive(struct sk_buff *skb) { struct nlmsghdr *nlh; /* * len MUST be signed for nlmsg_next to be able to dec it below 0 * if the nlmsg_len was not aligned */ int len; int err; nlh = nlmsg_hdr(skb); len = skb->len; mutex_lock(&audit_cmd_mutex);

timb

10/24/2019, 8:22 PM

(which is always the case, isn’t it...)

npamnani

10/24/2019, 8:22 PM

from kernel source

npamnani

10/24/2019, 8:23 PM

and sshd not able to acquire the mutex audit_cmd_mutex

npamnani

10/24/2019, 8:23 PM

this happen when used with reconfigure

npamnani

10/24/2019, 8:24 PM

first time, there are no rules install then no issue

timb

10/24/2019, 8:25 PM

because the problem is in deleteAuditRule?

npamnani

10/24/2019, 8:25 PM

once rules are insalled and error happen and code goes to reacquire the netlink socket then control gets in this vicious loop

npamnani

10/24/2019, 8:26 PM

I will create a fix tomorrow , hopefully it will fix your problem

👍 1

npamnani

10/24/2019, 8:26 PM

three problems

npamnani

10/24/2019, 8:27 PM

fd leak, tight loop and a check in deleteAuditRule

timb

10/24/2019, 8:28 PM

that would be great! i really appreciate your attention

theopolis

10/24/2019, 8:37 PM

epic

👍 3

Ivanlei

10/25/2019, 4:46 PM

truly epic. thank you so much @npamnani

npamnani

10/27/2019, 7:30 AM

Untitled

npamnani

10/27/2019, 7:34 AM

sorry for delay, I definitely fix this time

8 Views

Open in Slack

Previous Next