<@U7QP20JQH> for the JOIN, the sqlite logic to exe...
# core
s
@seph for the JOIN, the sqlite logic to execute it is implemented as two nested for loops, and each table lands on the external or internal for loop based on the order in which the tables appear in the JOIN. The external table is already called only once, but then each row is used to call the internal table. I don't think this can be changed (for instance, having the external table collect all the constraint values, and then sending them all together).
I also tested osquery 4.0.2 and indeed even without using a column that is index in the JOIN, the internal table is called multiple times.
The two for nested loop thing is documented here: https://www.sqlite.org/queryplanner-ng.html in the
Query Planning In SQLite
again I feel like the reason is that the internal logic of sqlite has been written for a sqlite DB file, and virtual tables come second, so the problem of not having to for instance re-read from the DB file the rows, each time, is handled by sqlite keeping things in memory, but this mechanism is expected to be handled by the xFilter implementation of the virtual table instead.
So that it permits you to do caching if you want or not if you don't. In any case here is where we add the constraints to the context: https://github.com/osquery/osquery/blob/8e8d596564e981ddac689e2d772f56fdf6ba25ec/osquery/sql/virtual_table.cpp#L915-L934 And indeed it seems that it might be possible for multiple values to be inserted for the same constraint (namely
constraint.first
remains the same between multiple
constraint
and
constraint.second
instead changes)
the problem is clearly the place that generates them. That piece of code is also the one that has other issues with NULL or empty constraint values, I forget which.
Ah and here it is, was in a comment
So https://github.com/osquery/osquery/pull/5924 was actually the PR that changed something about indexes
And testing on 4.0.2 btw I do see the old behavior of
IN
where it would call the table multiple times for each value in the constraint; but now, only does a single query.
Now (ish, 5.4.0):
Copy code
select pid from processes WHERE name IN ("launchd", "zsh", "whatever");
osquery planner: xBestIndex Evaluating constraints for table: processes [index=0 column=1 term=0 usable=1]
osquery planner: xBestIndex Recording constraint set for table: processes [cost=1000000.000000 size=0 idx=1]
osquery planner: xOpen Opening cursor (1) for table: processes
osquery planner: xFilter Filtering called for table: processes [constraint_count=1 argc=0 idx=1]
osquery planner: Scanning rows for cursor (1)
osquery planner: xFilter processes generate returned row count:710
osquery planner: Closing cursor (1)
+-------+
| pid   |
+-------+
| 1     |
| 40429 |
| 41257 |
| 47874 |
| 75073 |
| 75516 |
+-------+
osquery 4.0.2:
Copy code
select pid from processes WHERE name IN ("launchd", "zsh", "whatever");
osquery planner: Opening cursor (0) for table: processes
osquery planner: Adding constraint to cursor (0): name = launchd
osquery planner: Scanning rows for cursor (0)
osquery planner: Adding constraint to cursor (0): name = whatever
osquery planner: Scanning rows for cursor (0)
osquery planner: Adding constraint to cursor (0): name = zsh
osquery planner: Scanning rows for cursor (0)
osquery planner: Closing cursor (0)
+-------+
| pid   |
+-------+
| 1     |
| 40429 |
| 41257 |
| 47874 |
| 75073 |
| 75516 |
+-------+
see the difference in
Scanning rows for cursor
amounts
Also, at least with
IN
, if you do pass an index column (like pid for processes), it then starts calling xFilter multiple times again. This is somewhat controlled by this part of code: https://github.com/osquery/osquery/blob/8e8d596564e981ddac689e2d772f56fdf6ba25ec/osquery/sql/virtual_table.cpp#L764-L774
especially that
continue
; this is based on the fact that sqlite will always filter again the rows/results you give it. It's what makes things work when our tables do not have any index/primary key that we can query, and they always return all the rows.
In any case, if there's no way to coerce sqlite to group the constraint values, I think there's a possibility in xFilter of collecting all the constraint values on the same columns each time it's called; so basically in that
IN
case on an index column, you pass 3 constraint values, xFilter will be called 3 times and each time you collect the constraint and don't call the table, only at the last call you pass the entire result. This though it's tricky obviously and would only work if the constraint set that xFilter is called with contains 1 constraint, unless we know for sure that sqlite will call xFilter multiple times with the same set, and when the set changes, then we will never get the same set again in the same query. Meaning, if the query doesn't only have a IN, but also an AND, and an OR with another list of constraints.. we want to group all the values from the IN + AND, and then group the constraint after the OR on their own.
and call the generate when the set changes (although I'm not sure if you can properly detect that before it happens). Anyway /ramblings
Copy code
Then, under the usual mode of handling IN operators, SQLite generates bytecode that invokes the xFilter() method once for each value on the right-hand side of the IN operator. Thus the virtual table only sees a single value from the right-hand side of the IN operator at a time.

In some cases, however, it would be advantageous for the virtual table to see all values on the right-hand of the IN operator all at once
And a final (for real) thing I forgot to say earlier; where today the
IN
constraint causes a single call to xFilter is not because https://osquery.slack.com/archives/C08VA3XQU/p1696964943771509?thread_ts=1696961882.399859&amp;cid=C08VA3XQU causes the xFilter to be called once with all the values of the constraint; no, we cause the constraint to become unusable (because the cost is set to the maximum), and since there aren't any other constraints, then sqlite is forced to call xFilter once, with no constraint, which means the table will return all the rows. But then when receiving those rows from the xFilter, sqlite will filter them as usual.
Another relevant discussion: https://sqlite.org/forum/info/3619760999ccfc4d5ba5a178e1b1cc1bf444c948ecf05e5bd80afc5e6aca91e2. Although the OP wants to have xFilter called always once with no constraints on a JOIN, there's discussion about the fact that one has to expect xFilter to be called multiple times, once for each row in the external loop. And there's again mention about
sqlite3_vtab_in
or changing how the query is formulated. It should also be possible to do some internal caching and/or as I was saying yesterday collect all the constraints ourselves, but it's not something directly supported by sqlite; it definitely seems that sqlite expects virtual tables to be fast on their own with retrieving data.
s
Thank you for digging. I’ve noted this thread.
Your observation that even if we fix
IN
, there is still complex boolean logic around
AND
and
OR
makes me think that we should be playing with some internal cache mechanism instead of trying to adjust the xFilter behavior