Column of increasing integers: why is an index needed to speed up the query?

Question

In the following example, t is an increasing sequence going roughly from 0 to 5,000,000 in 1 million of rows.

import sqlite3, random, time
t = 0
db = sqlite3.connect(':memory:')
db.execute("CREATE TABLE IF NOT EXISTS data(id INTEGER PRIMARY KEY, t INTEGER, label TEXT);")
for i in range(1000*1000):
    t += random.randint(0, 10)
    db.execute("INSERT INTO data(t, label) VALUES (?, ?)", (t, 'hello'))

Selecting a range (let's say t = 1,000,000 ... 2,000,000) with an index:

db.execute("CREATE INDEX t_index ON data(t);")
start = time.time()
print(list(db.execute(f"SELECT COUNT(id) FROM data WHERE t BETWEEN 1000000 AND 2000000")))
print("index: %.1f ms" % ((time.time()-start)*1000))  # index: 15.0 ms

is 4-5 times faster than doing it without an index:

db.execute("DROP INDEX IF EXISTS t_index;")
start = time.time()
print(list(db.execute(f"SELECT COUNT(id) FROM data WHERE t BETWEEN 1000000 AND 2000000")))
print("no index: %.1f ms" % ((time.time()-start)*1000))  # no index: 73.0 ms

but the database size is at least 30% bigger with an index.

Question: in general, I understand how indexes massively speed up queries, but in such a case where t is integer + increasing, why is an index even needed to speed up the query?

After all, we only need to find the row for which t=1,000,000 (this is possible in O(log n) since the sequence is increasing), find the row for which t=2,000,000, and then we have the range.

TL;DR: when a column is an increasing integer sequence, is there a way to have a fast query for a range, without having to increase +30% the database size with an index?

For example by setting a parameter when creating the table, informing Sqlite that the column is increasing / already sorted?

Allan Wind · Accepted Answer · 2020-12-22T23:36:12.107

The short answer is that sqlite doesn't know that your table is sorted by column t. This means it has to scan through the whole table to extract the data.

When you add an index on the column, the column t is sorted in the index, so it can skip the first million rows and then stream the rows of interest to you. You are extracting 20% of the rows, and it returns in 15 ms / 73 ms = 21% of the time. If that fraction is smaller, the benefit you derive from the index is larger.

If the column t is unique, then consider using that column as the primary key, as you would get the index for "free". If you can bound the number of rows with the same t, then you use (t, offset) as primary key where offset might be a tinyint. The point being that size(primary key index) + size(t index) would larger than size(t+offset index). If t was in ms or ns instead of s, it might be unique in practice, or you could fiddle with it when it was not (and just truncate to second resolution when you need the data).

If you don't need a primary key (as unique index), leave it out and just have the non-unique index on t. Without a primary key you can identify a unique row by rowid, or if all the columns collectively create an unique row. If you create the table without rowid, you could still use limit to operate on identical rows.

You could use database warehouse techniques, if you don't need per record data, store it in a less granular fashion (record per minute, or per hour and group_concat the text column).

Finally, there are databases that are optimized for time-series data. They may, for instance, only allow you to remove the oldest data or append new data but not make any changes. This would allow such as system to store the data pre-sorted (mysql, btw, call this feature index ordered table). As the data cannot change, such a database my run-length or delta compress data by column so it only stores differences between rows.

Thank you @AllanWind for your answer. In my real case, `t` is an (integer) unix timestamp (datetime), so it's not guaranteed unique (multiple events can happen at the same second), so I think I can't use this column as PRIMARY KEY. — Basj, Dec 22 '20 at 22:47
*When you add an index on the column, the column t is sorted in the index*: since `t` is *aleady* sorted, doing an index is a waste of disk space. Isn't there a way to inform Sqlite that `t` is *already* sorted, and that it can use this info to speed range "range" lookups? So we can save the additional index. — Basj, Dec 22 '20 at 22:49
Say, there way a way to tell the database "I promise that if id > id' then t > t'". How would the database check that (hint: use an index to do so quickly)? If you don't check and break the promise your data is now "corrupted". — Allan Wind, Dec 22 '20 at 23:11
@Basj There *is* a way to inform Sqlite that `t` is already sorted. It's called adding an index. If it were possible to inform it by another mechanism there is always the possibility that you could tell it something that is not 100% accurate, or that was initially accurate but became inaccurate due to subsequent changes to the data in the table. — BoarGules, Dec 22 '20 at 23:29

Column of increasing integers: why is an index needed to speed up the query?

1 Answers1