SQLite3 query optimization join vs subselect

Question

I am trying to figure out the very best way, (probably doesn't matter in this case) to find the rows of one table, based on the existence of a flag, and an relational id in a row in another table.

here are the schemas:

    CREATE TABLE files (
id INTEGER PRIMARY KEY,
dirty INTEGER NOT NULL);

    CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

I am using SQLite3

there files table will be very large, 10K-5M rows typically. the resume_points will be small <10K with only 1-2 distinct scan_file_id's

so my first thought was:

select distinct files.* from resume_points inner join files
on resume_points.scan_file_id=files.id where files.dirty = 1;

a coworker suggested turning the join around:

select distinct files.* from files inner join resume_points
on files.id=resume_points.scan_file_id where files.dirty = 1;

then I thought since we know that the number of distinct scan_file_id's will be so small, perhaps a subselect would be optimal (in this rare instance):

select * from files where id in (select distinct scan_file_id from resume_points);

the explain outputs had the following rows: 42, 42, and 48 respectively.

This depends on your data and your hardware. You have to measure this yourself. — CL., Jun 29 '13 at 08:10

John Tseng · Accepted Answer · 2013-07-05T13:26:10.100

TL;DR: The best query and index is:

create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;

Since I typically work with SQL Server, at first I thought that surely the query optimizer would find the optimal execution plan for such a simple query regardless of which way you write these equivalent SQL statements. So I downloaded SQLite, and started playing around. Much to my surprise, there was a huge difference in performance.

Here's the setup code:

CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);

CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

I considered two indices:

create index dirtyFiles on files (dirty, id);
create index uniqueFiles on resume_points (scan_file_id);
create index fileLookup on files (id);

Below are the queries I tried and the execution times on my i5 laptop. The database file size is only about 200MB since it doesn't have any other data.

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
    14000 ms with uniqueFiles and fileLookup

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles

It looks like SQLite's query optimizer isn't very advanced at all. The best queries first reduce resume_points to a small number of rows (Two in the test case. The OP said it would be 1-2.), and then look up the file to see if it is dirty or not. dirtyFiles index didn't make much of a difference for any of the files. I think it may be because of the way the data is arranged in the test tables. It may make a difference in production tables. However, the difference is not too great as there will be less than a handful of lookups. uniqueFiles does make a difference since it can reduce 10000 rows of resume_points to 2 rows without scanning through most of them. fileLookup did make some queries slightly faster, but not enough to significantly change the results. Notably it made group by very slow. In conclusion, reduce the result set early to make the biggest differences.

score 1 · Answer 2 · answered Jun 28 '13 at 22:20

Since files.id is the primary key, try GROUPing BY this field rather than checking DISTINCT files.*

SELECT f.*
FROM resume_points rp
INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1
GROUP BY f.id

Another option to consider for performance is adding an index to resume_points.scan_file_id.

CREATE INDEX index_resume_points_scan_file_id ON resume_points (scan_file_id)

score 1 · Answer 3 · answered Jul 02 '13 at 00:20

You could try exists, which will not produce any duplicate files:

select * from files
where exists (
    select * from resume_points 
    where files.id = resume_points.scan_file_id
)
and dirty = 1;

Of course it might help to have the proper indexes:

files.dirty
resume_points.scan_file_id

Whether an index is helpful will depend on your data.

Koryu · Answer 4 · 2013-07-03T12:12:44.800

I think jtseng gave the solution.

select * from (select distinct scan_file_id from resume_points) d
join files on d.scan_file_id = files.id and files.dirty = 1

Basically it's the same what you have posted as your last option:

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;

It's beacuse you have to avoid a full table scan/join.

So at first you need your 1-2 distinct ids:

select distinct scan_file_id from resume_points

after that only your 1-2 rows have to be joined on the other table instead of all 10K, which gives the performance optimization.

if you need this statement several times, i would put it into a view. the view wont change the performance but it looks cleaner/easier to read.

also check the query optimization documentation: http://www.sqlite.org/optoverview.html

score 0 · Answer 5 · answered Jul 02 '13 at 04:15

If the table "resume_points" will have only one or two distinct file id numbers, it seems to need only one or two rows, and seems to need scan_file_id as the primary key. That table only has two columns, and the id number is meaningless.

And if that's the case, you don't need either of the ID numbers.

pragma foreign_keys = on;
CREATE TABLE resume_points (
  scan_file_id integer primary key
);

CREATE TABLE files (
  scan_file_id integer not null references resume_points (scan_file_id),
  dirty INTEGER NOT NULL,
  primary key (scan_file_id, dirty)
);

And now you don't need the join, either. Just query the "files" table.

SQLite3 query optimization join vs subselect

5 Answers5