Title
#general
t

TonyC

12/24/2019, 5:05 AM
Hello, I wrote a couple extensions for OsQuery that have long run times as much as 30 seconds. When scheduling them to both run every 300 seconds in a pack, they sometimes run at the same time, or at least overlap. Is there a way to guarantee ALL scheduled queries are run serially, and never at the same time? Thank you
a

alessandrogario

12/24/2019, 11:06 PM
i don't think it is possible; you could try using different timers but it's not guaranteed it's enough to avoid that situation
11:06 PM
it would be best not to have long query processing times and instead do it in another thread and return whatever is ready at query time
11:07 PM
i.e. one thread fills a list of rows, and the generate() returns them
t

TonyC

12/26/2019, 7:16 PM
I may have found a fix by instead of using a pack with multiple individual queries, into a pack that calls a single combined query. By combining all the individual queries int a single query, they each run in series. This is obviously not optimal as it will only work if you want all the queries to run in the same interval of say every 5 minutes. If you want an addition query to run every 1 minute, there is potential for that 1 minute query to run at the same time as a 5 minute combined query. IMO there should be logic in OsQuery to not execute a query in a pack until any other queries have already completed. IE a Queue. I know most of what OsQuery does out of the box is pretty instant, and this problem I have is pretty much non existent, however, OsQuery is designed to run custom extensions, and users like me need to run longer running queries.
a

alessandrogario

12/26/2019, 7:18 PM
I think this is a design issue inside the extension; additionally, I'm not sure that SQLite itself actually likes when tables locks inside virtual tables (it's not something it was made for)
7:19 PM
I don't think it makes sense to support extensions that have long query times
t

TonyC

12/26/2019, 7:19 PM
I don't think there is a way for the extension to check if another query is already in progress
a

alessandrogario

12/26/2019, 7:19 PM
queries should return data that is ready
7:20 PM
or at least, data that can be easily gathered without locking inside the SQLite callbacks
7:20 PM
I would suggest to fix the extension to only return data ready to be returned back to osquery, and move the worker outside the main thread
7:20 PM
this would be similar to the event-based tables that are already present in osquery
7:21 PM
you can also use writable tables if you wish to control which data is produced from SQL
7:21 PM
i.e. INSERT INTO your_extension_table <job_details>
t

TonyC

12/26/2019, 7:21 PM
If you have multiple records with millions of records, and you have to do joins, it will take time to gather the data. Having a long running querie is no different IMO. I'm not asking for OsQuery to support longer running queries, I'm stating that OsQuery should not run queries at the same time. That would just happen to resolve my issue 🙂
a

alessandrogario

12/26/2019, 7:21 PM
and in another table save the data that has been produced
7:22 PM
that reason is exactly why no heavy work should happen inside generate()
7:22 PM
it's not always 100% clear how many times you are hitting each table when a complex query is run
7:24 PM
you can take a look at event-based queries and replicate that functionality within your extension
7:24 PM
if you want to control the component producing data with INSERTs, then you can also look at the extension examples
7:25 PM
this way correctness will always be enforced
t

TonyC

12/26/2019, 7:25 PM
The extensions I'm writing are not the standard use case. We use OsQuery to report the health of the Client, including network access. IE the client is currently utilizing CPU/Mem/HDD, and running a speedtest has so much bandwidth available. We compare what clients are doing at each site to determine if there is a problem at a specific site as compared to others. Yes, this technically goes beyond the scope of OsQuery, but again, OsQuery is designed to be extended
a

alessandrogario

12/26/2019, 7:25 PM
(otherwise, even if the schedule is serialized, the user can still lock your extension/osquery if more than one heavy table is referenced in the same SQL query)
7:28 PM
osquery is designed to be extended, but there are rules in place to provide correctness and robustness
7:28 PM
additionally, there's SQLite behind the tables, and this also adds additional constraints
7:29 PM
SQLite is being used in serialized mode (default configuration)
7:29 PM
it is bad to lock everything inside its callbacks
t

TonyC

12/26/2019, 7:30 PM
It's not a lock if it's scheduled. IE before query is executed from a pack, ensure no other queries are currently active
7:31 PM
OsQuery is managing when a query is made to the SQLite service. I'm just stating that OsQuery should make sure no other execution requests have been made until the others have completed
a

alessandrogario

12/26/2019, 7:31 PM
i don't think it matters whether the query is scheduled or not; regardless, when the query hits the table everything is inside SQLite
7:32 PM
remembering to not perform too many queries against specific tables is a really bad user experience design
t

TonyC

12/26/2019, 7:33 PM
Again, OsQuery manages when a query hits the table. It manages the schedule of a pack and when the query in that pack runs, IE every 300 seconds. It can certain have logic to state, it's time for the next interval to execute. Are all other complete? If so run
a

alessandrogario

12/26/2019, 7:33 PM
delaying queries like that is just a hack that will fall short as soon as multiple ad-hoc queries are run
7:33 PM
or a single SQL query joins multiple tables that perform too much work in their own generate() functions
t

TonyC

12/26/2019, 7:34 PM
That is true. ad-hoc queries would throw a wrench in that scenario
a

alessandrogario

12/26/2019, 7:35 PM
yeah, same for SELECT FROM heavy_table1 JOIN heavy_table2 JOIN heavy_table3
7:36 PM
You could try opening a feature request to see what everyone would think about implementing this
7:36 PM
It could be (maybe?) useful, but I think that as it is it's the wrong answer to a wrong question
7:37 PM
there are some additional hacks that can be done easy(sh) though
t

TonyC

12/26/2019, 7:37 PM
My use case is pretty unique. I doubt others are facing the same issue, otherwise there would probably already be a feature request. My initial question has been answered that OsQuery will run scheduled queries at the same time. I was hoping the answer was no
a

alessandrogario

12/26/2019, 7:38 PM
like providing a new table that contains a boolean for each running generate() in your extension
7:38 PM
then decrease the schedule time
7:38 PM
and use that table to control whether the query ends up inside the big table
t

TonyC

12/26/2019, 7:38 PM
hmm, that's an interesting approach.
a

alessandrogario

12/26/2019, 7:39 PM
so that metadata table would contain a row for each busy table
7:39 PM
that lives inside the extension
7:39 PM
can probably also be hacked together with discovery queries
7:39 PM
it's not a unique use case though
7:40 PM
i have written a bpftrace integration that has a generate function that runs forever
7:40 PM
but the table exposed to osquery contains jobs
7:40 PM
that can be added/removed with INSERT and DELETE
t

TonyC

12/26/2019, 7:40 PM
I can just add a variable in my extension that keeps track. Would be easier if there was a way to query OsQuery for running queries 🙂
a

alessandrogario

12/26/2019, 7:40 PM
and everything else is handled as events
t

TonyC

12/26/2019, 7:42 PM
Well again in my case, I am doing speedtests. The reason I want it serialized is I don't want 2+ speedtests running at the same time. That affects the total throughput available. If it was just a case of long running queries, it wouldn't matter as the query would just report when it completed
a

alessandrogario

12/26/2019, 7:42 PM
this is usually handle with caching
7:43 PM
cache expires every X minutes, and a hidden column to the table is added
7:43 PM
so that it can be forcefully cleared
7:43 PM
SELECT * FROM speed_test;
would use the cache automatically, it not expired
7:43 PM
SELECT * FROM speed_test WHERE force_cache_invalidation=1;
7:43 PM
would force it from scratch
7:44 PM
we already have tables that work (kind of) like this
t

TonyC

12/26/2019, 7:44 PM
caching wouldn't help if I have a pack with 5+ individual queries that are scheduled to run every 300 seconds. Each speed test would run at the same time which would cause invalid results
a

alessandrogario

12/26/2019, 7:44 PM
the HIDDEN column will not show up in SELECT statements unless directly queried
7:46 PM
it could return no rows if there is no measure and it can't be taken
t

TonyC

12/26/2019, 7:46 PM
To be clear, I am running 5 different tests in my queries. IE select * from speedtest where dc = 1, and select * from speedtest where dc = 2. When executed, it runs an iperf like test to the specified DC. If both of these queries are running at the same time, the throughput for each query is cut in half
a

alessandrogario

12/26/2019, 7:48 PM
if the table is the same, a more sqlite/osquery friendly approach would be to remove the WHERE
7:48 PM
pass the DC list via configuration
7:48 PM
so that the 5 DCs are tested in a single SELECT * FROM speedtest
t

TonyC

12/26/2019, 7:48 PM
We want to monitor the bandwidth the client has to each DC. When scheduling one to run every 300 seconds, and the second every 330, there is still potential for overlap
a

alessandrogario

12/26/2019, 7:48 PM
so one row per DC passed via configuration
7:49 PM
(and still use caching when possible) this would be user friendly, and hard to break
7:49 PM
from SQL
t

TonyC

12/26/2019, 7:50 PM
I have a workaround by having a single query select * from speedtest where dc = 1 select * from speedtest where dc = 2 .... With this single saved query, each is executed in series and resolves the problem of having them in separate saved queries which would be called at the same time by the pack
7:51 PM
I actually have a temporary table with all the DCs defined, and run select * from speedtest where dc in (dctable)
7:53 PM
This runs everything in series, but causes a problem if I wanted to have different speedtests run at different times. IE dcs 1, 2 at every 300 seconds, and dcs 3,4 at every 600 seconds. There will be overlap and multiple speedtests will run at the same time
7:54 PM
I will just have to keep track in the extension when tests are being run, and don't execute the pack if in progress