Hello! Is it clear that if I set such options in o...
# general
a
Hello! Is it clear that if I set such options in osquery.flags
--watchdog_memory_limit=300
--watchdog_utilization_limit=130
it will be more than default values for Watchdog restrictions for CPU and RAM? I am asking because we tried to increase these values but got another situation with lot’s of denylisted queries.
t
The assumption is correct, heads up that you cannot set these values from the configuration, and double-check osquery_flags to verify they are set correctly.
a
@theopolis thank you for answer! Then I don’t understand at all what the problem is. We got the opposite situation, this is what we received on failed queries on the day when we centrally updated the osquery.flags on windows and mac with these values
Hi @theopolis! We started osquery.flags updating process again 2 hours ago with such values:
--watchdog_memory_limit=400
--watchdog_utilization_limit=250
But still we get a lot of logs with message: ” Scheduled query may have failed…“. I don’t understand the reason at all. Our osquery.flags content (example for windows):
Copy code
--enroll_secret_path=C:\Program Files\osquery\enroll_secret
--tls_server_certs=C:\Program Files\osquery\server.pem
--tls_hostname=<censored>
--pidfile=C:\Program Files\osquery\osqueryd.pidfile
--host_identifier=hostname
--enroll_tls_endpoint=/api/v1/osquery/enroll
--config_plugin=tls
--config_tls_endpoint=/api/v1/osquery/config
--config_refresh=60
--disable_distributed=false
--disable_events=true
--disable_extensions=true
--disable_tables=curl
--distributed_plugin=tls
--distributed_interval=60
--distributed_tls_max_attempts=3
--distributed_tls_read_endpoint=/api/v1/osquery/distributed/read
--distributed_tls_write_endpoint=/api/v1/osquery/distributed/write
--logger_plugin=tls
--logger_tls_endpoint=/api/v1/osquery/log
--logger_tls_period=60
--watchdog_memory_limit=400
--watchdog_utilization_limit=250
Is it possible fo find the reason in status.logs? If you tell me what information could be useful to you in order to understand the reason - write please, I think I can give it to you.
I’ve created label like
SELECT * FROM osquery_schedule where denylisted='1' and name=<our query name>
and its results correspond to the logs, that is, queries are massively blocked again. Looked through the status logs via ELK for a few selected clients, but found nothing to suggest the cause of the problem. Is it possible to do some kind of debugging on a separate host? Is there any related information about watchdog work in general written to the logs? So far, we see such situation on Windows, and we have not yet updated on MacBooks in large quantities. For understanding, we use native osquery and fleet
@zwass hello! Sorry, don’t wanna bother you, but may be you as author of https://dactiv.llc/files/osquery-performance-at-scale.pdf could give us some hints how to find the cause of problem please? If you need any information or actions from our side, I am ready to help.
z
What is the query that is triggering the watchdog?
a
@zwass I’ll send it to your DM with small changes
z
Can you pick one of the machines it fails on and run osquery via a shell so that you can see the stderr logs? Then you would see
Maximum sustainable CPU...
or
Memory limits exceeded...
logs in stderr.
a
Hi! Small updates: After enabling --verbose in osqueryd on most our users, I can confirm that the killing of the query happens immediately after restarting the osuqeryd service. It seems that resources are irrelevant here. I also saw no errors about resources when running osqueryd from cmd. Most users have similar logs:
Same log from another user. But here I see additional encoding problems with some logs
Hi! Could you please tell me, is it possible to guess from this information what the problem may be? To be honest, I’m running out of ideas and it’s sad. While collecting osuqeryd logs from cmd.exe with one of the users, we did not see the data that Zach wrote about, in fact there was the same thing as shown in the screenshots. Maybe there are some other ways to get additional information? I deployed a test VM with Windows 10 and osqueryd connected to test Fleet, but I could not reproduce this error there. On Monday I will ask my colleagues to reinstall the osquery for all users through SCCM, but if this does not help, then I do not even know what to do next.
t
It sounds like the error is one of four things, so it may help to isolate them one by one. It could be that the watchdog settings are not being applied correctly, to isolate this you can disable the watchdog completely “—disable_watchdog”. It could be the query is actually very resource intensive, to isolate try running the query in osqueryi on one of these machines. It could be some error in osquery that we cannot debug with this info, unsure how to conclude this but it’s a possibility. It could be some other configuration or state issue, in which reinstalling may/may not help.
I’m unsure what version you are using, but if you do reinstall, make sure to update to the latest.
a
Hi @theopolis Thank you for answer! We are trying to narrow the list of problem reasons. For now, I can say that osqueryi does not block the request in any way. And we use the current version of osquery - 4.6.0. I will be back when there is new data.
t
osqueryi would not block the query, it runs any query you give it regardless of performance impact, but you can observe the cpu and memory impact while running the shell
a
@theopolis hi! I come back with an update according to your options. None of them have solved the problem yet. We tried reinstalling osquery on some hosts and disabling the watchdog. The query still gets denylisted when the osqueryd service is restarted. This is typical for restarting the service in Windows, and for starting osqueryd in the cmd. I repeat that we did not see any error related to resource exceeding through the cmd.exe. As far as I can see, this situation is also typical for Macs, but the number of requests falling into the blacklist is multiple times less than for Windows. This theory is confirmed by the fact that the greatest number of drops in queries to the blacklist occurs precisely on the days when we launched the deployment via sccm and jamf. The number of errors is gradually decreasing. The graph of such logs from elastic is attached.
t
Thanks for all of the information, let me focus attention on one thing you mentioned "and disabling the watchdog". If the watchdog is disabled then it is impossible for queries to be denylisted due to performance violations. If queries are still being denylisted then either the watchdog is still running or the query is crashing osquery somehow.
So we need to verify if the watchdog is truely disabled. If it is not then the way you are configuring the
osquery.flags
file is not working. If you have a scheduled query against
osquery_info
then you can verify the watchdog is disabled by checking the
watcher
column, it should be
-1
.
a
Yes, you are right. Sorry for the hasty conclusions. I did query
SELECT * FROM osquery_flags where name='disable_watchdog'
and I see that disabling the watchdog was not applied correctly through the SCСM. Here is config osquery.flags from one of our users. We tried to restart osqueryd via SCCM and manually. But after all this, I see that disabling the watchdog is still not applicable. We are dealing with this now.