SQLite R*Tree index not used with DISTINCT

Question

In SQLite 3.20.1, I have an R*Tree index (dog_bounds) and a temporary table (frisbees) created as follows:

-- Changes infrequently and has ~100k entries
CREATE VIRTUAL TABLE dog_bounds USING rtree (
    dog_id,
    min_x, max_x,
    min_y, max_y
);

-- Changes frequently and has ~100 entries
CREATE TEMPORARY TABLE frisbees (
    frisbee_id,
    min_x, max_x,
    min_y, max_y
);

Queries are fast using this index, like so:

EXPLAIN QUERY PLAN
SELECT dog_id FROM dog_bounds AS db, frisbees AS f
    WHERE db.max_x >= f.min_x AND db.max_y >= f.min_y
    AND db.min_x < f.max_x AND db.min_y < f.max_y;

0|0|1|SCAN TABLE frisbees AS f
0|1|0|SCAN TABLE dog_bounds AS db VIRTUAL TABLE INDEX 2:D1D3C0C2

However, if I select DISTINCT(dog_id), the index is no longer used, and queries become slow, even after ANALYZE:

EXPLAIN QUERY PLAN
SELECT DISTINCT(dog_id) FROM dog_bounds AS db, frisbees AS f
    WHERE db.max_x >= f.min_x AND db.max_y >= f.min_y
    AND db.min_x < f.max_x AND db.min_y < f.max_y;

0|0|0|SCAN TABLE dog_bounds AS db VIRTUAL TABLE INDEX 2:
0|1|1|SCAN TABLE frisbees AS f
0|0|0|USE TEMP B-TREE FOR DISTINCT

How can I get the R*Tree index used here as well? It would be a shame to duplicate dogs!

Dog_id is always unique isn’t it? No need for distinct function call.. — boateng, May 30 '18 at 18:36
@numbtongue `dog_id` may be unique in `dog_bounds`, but not in the frisbee-overlap query results; the non-distinct query returns 526 rows, while the distinct one returns 135. — sjones, May 30 '18 at 18:44
I see, is it possible to create a composite primary key for the frisbee-overlap and select that “without” distinct.. I feel calling this completely disregards the index.. — boateng, May 30 '18 at 18:51
@numbtongue the frisbees change quite frequently and their overlap cannot be know ahead of time (I updated the schema snipped with this info) — sjones, May 30 '18 at 18:58
I would try with inner join http://www.sqlitetutorial.net/sqlite-inner-join/ — boateng, May 30 '18 at 19:23

score 3 · Accepted Answer · answered May 30 '18 at 19:14

The query optimizer thinks that a different execution order makes it easier to get distinct dog_id values.

Move the R-tree lookup into a subquery so that the query optimizer is forced to do both things separately:

SELECT DISTINCT dog_id
FROM (SELECT dog_id
      FROM dog_bounds AS db, frisbees AS f
      WHERE db.max_x >= f.min_x AND db.max_y >= f.min_y
        AND db.min_x <  f.max_x AND db.min_y <  f.max_y);

QUERY PLAN
|--SCAN TABLE dog_bounds AS db VIRTUAL TABLE INDEX 2:
|--SCAN TABLE frisbees AS f
`--USE TEMP B-TREE FOR DISTINCT

Oops, the query optimizer was too clever and flattened the subquery. But there are ways to disable it (rule 21):

SELECT DISTINCT dog_id
FROM (SELECT dog_id
      FROM dog_bounds AS db, frisbees AS f
      WHERE db.max_x >= f.min_x AND db.max_y >= f.min_y
        AND db.min_x <  f.max_x AND db.min_y <  f.max_y
      LIMIT -1);

QUERY PLAN
|--CO-ROUTINE 0x892A90
|  |--SCAN TABLE frisbees AS f
|  `--SCAN TABLE dog_bounds AS db VIRTUAL TABLE INDEX 2:D1D3C0C2
|--SCAN SUBQUERY 0x892A90
`--USE TEMP B-TREE FOR DISTINCT

SQLite R*Tree index not used with DISTINCT

1 Answers1