In the following example, t
is an increasing sequence going roughly from 0 to 5,000,000 in 1 million of rows.
import sqlite3, random, time
t = 0
db = sqlite3.connect(':memory:')
db.execute("CREATE TABLE IF NOT EXISTS data(id INTEGER PRIMARY KEY, t INTEGER, label TEXT);")
for i in range(1000*1000):
t += random.randint(0, 10)
db.execute("INSERT INTO data(t, label) VALUES (?, ?)", (t, 'hello'))
Selecting a range (let's say t = 1,000,000 ... 2,000,000) with an index:
db.execute("CREATE INDEX t_index ON data(t);")
start = time.time()
print(list(db.execute(f"SELECT COUNT(id) FROM data WHERE t BETWEEN 1000000 AND 2000000")))
print("index: %.1f ms" % ((time.time()-start)*1000)) # index: 15.0 ms
is 4-5 times faster than doing it without an index:
db.execute("DROP INDEX IF EXISTS t_index;")
start = time.time()
print(list(db.execute(f"SELECT COUNT(id) FROM data WHERE t BETWEEN 1000000 AND 2000000")))
print("no index: %.1f ms" % ((time.time()-start)*1000)) # no index: 73.0 ms
but the database size is at least 30% bigger with an index.
Question: in general, I understand how indexes massively speed up queries, but in such a case where t
is integer + increasing, why is an index even needed to speed up the query?
After all, we only need to find the row for which t=1,000,000 (this is possible in O(log n) since the sequence is increasing), find the row for which t=2,000,000, and then we have the range.
TL;DR: when a column is an increasing integer sequence, is there a way to have a fast query for a range, without having to increase +30% the database size with an index?
For example by setting a parameter when creating the table, informing Sqlite that the column is increasing / already sorted?