i think i have an easy one today... finally got my...
# fleet
m
i think i have an easy one today... finally got my first denylisted query on a host... so denylisted = 1... after 24 hours will that turn back to 0 on its own or is there a way to force that "bit" to reset back to 0?
k
Hi @mason kemmerer! I haven't found a method of manually removing a query from the denylist, it will roll back to 0 once it expires.
m
2 questions... are there any fleet documented ways to measure or potentially warn admins that a query is non-performant BEFORE the watchdog service denylists it? I love the Performance Impact column in the fleet webgui but wish it had a better mathematic explanation as what defines the thresholds: minimal, considerable, excessive I wish there was to eval this in live queries before they go live (scheduled) to all hosts Since starting this fleet-osquery endeavor I have had this query running every 5 minutes:
Copy code
SELECT name, query, interval, executions, last_executed, denylisted, output_size,
  IFNULL(system_time / executions, 0) AS avg_sys_time,
  IFNULL(user_time / executions, 0) AS avg_usr_time,
  IFNULL(wall_time / executions, 0) AS avg_wall_time,
ROUND((average_memory * '10e-7'), 2) AS avg_mem_mb 
FROM osquery_schedule;
When reviewing the results of the above query in splunk it seemed the logs were showing that the denylisted query's avg sys, usr, and wall times and memory usage were zero leading up to the query being denylisted on the host. which was a huge bummer. FWIW, i have the watchdog service set to default settings (200M or 10% cpu for 12 secs) which also makes me wonder if theres something i misunderstood... is that 10% of the total CPU on the system, available at query execution time, or even a single core? Just trying to make sense where my miss was so I can adjust my osquery and splunk dashboard accordingly
k
The challenge there is that a query being deny-listed doesn't necessarily mean that it was non-performant.
It just means that something triggered the watchdog when it was running - it could have been this query, but it could have been something else entirely.
We've got a ticket together to document the logic behind the performance metrics... let me grab the TL:DR
> If the median host runtime for the query is up to 2 seconds, we consider the impact "Minimal". Up to 4 seconds would be "Considerable". More than that is "Excessive". If it hasn't actually run successfully, it's "Undetermined".
We've also recently started including live query runs in the performance metrics. So if you save a query and run it a few times, you'll start to see stats.
m
All of that is super helpful (i didnt know that things outside of osquery could cause in the trigger of the watchdog) and thanks for explaining the impact definitions. Ah allowing live queries to determine performance before scheduling with all hosts!?! 🤌
is that something coming in a later update of fleet or available in v4.43.0+?
k
It isn't that things outside of osquery might trigger the watchdog, it's just that this particular query might not be the thing that did.
m
ah so would you say its more the culmination of all the queries running on the host, and the watchdog just happened to pick this query to denylist?
sorry for all the followup questions, just trying to validate my understanding
k
Totally valid questions.
The watchdog is just looking at the overall CPU and Memory use of osquery. So the watchdog triggers based on a culmination of everything osquery is doing at the time. When it triggers, all queries that were running at the time are added to the denylist just in case they were the one that caused the watchdog to trigger.
m
ah i guess in this particular case just that 1 query from what i saw was denylisted
k
That's the most likely scenario, but it could also be something else that was happening in the background. It's interesting that the user time and system time were zero before the query was denylisted. What's the actual query?
m
it was....
select * from process_memory_map;
Daily frequency
historically the performance impact reported by fleet across all hosts was minimal... i think that host was particularly busy and i was disappointed I couldnt detect this query was problematic prior to being denylisted since it didnt seem this query was particularly not performant.
if anything i think im learning, its not always the query itself that can cause a denylist.. but perhaps the condition of the host at query runtime its being denylisted on/at?