I'm doing a personal project that generates a lot of data, and I thought that storing it in a local DB made sense. However, I'm seeing insane slowdown as the DB grows, which makes it infeasible to run.
I made a simple test showing what I'm doing. I make a dictionary where I do a bunch of local processing (roughly 1 million entries), then batch insert that into the SQLite DB, then loop and do it all again. Here's the code:
from collections import defaultdict
import sqlite3
import datetime
import random
def log(s):
now = datetime.datetime.now()
print(str(now) + ": " + str(s))
def create_table():
conn = create_connection()
with conn:
cursor = conn.cursor()
sql = """
CREATE TABLE IF NOT EXISTS testing (
test text PRIMARY KEY,
number integer
);"""
cursor.execute(sql)
conn.close()
def insert_many(local_db):
sql = """INSERT INTO testing(test, number) VALUES(?, ?) ON CONFLICT(test) DO UPDATE SET number=number+?;"""
inserts = []
for key, value in local_db.items():
inserts.append((key, value, value))
conn = create_connection()
with conn:
cursor = conn.cursor()
cursor.executemany(sql, inserts)
conn.close()
def main():
i = 0
log("Starting to process records")
for i in range(1, 21):
local_db = defaultdict(int)
for j in range(0, 1000000):
s = "Testing insertion " + str(random.randrange(100000000))
local_db[s] += 1
log("Created local DB for " + str(1000000 * i) + " records")
insert_many(local_db)
log("Finished inserting " + str(1000000 * i) + " records")
def create_connection():
conn = None
try:
conn = sqlite3.connect('/home/testing.db')
except Error as e:
print(e)
return conn
if __name__ == '__main__':
create_table()
main()
This runs great for a second, then slows down like crazy. Here's the output I just got:
2019-10-23 15:28:59.211036: Starting to process records
2019-10-23 15:29:01.308668: Created local DB for 1000000 records
2019-10-23 15:29:10.147762: Finished inserting 1000000 records
2019-10-23 15:29:12.258012: Created local DB for 2000000 records
2019-10-23 15:29:28.752352: Finished inserting 2000000 records
2019-10-23 15:29:30.853128: Created local DB for 3000000 records
2019-10-23 15:39:12.826357: Finished inserting 3000000 records
2019-10-23 15:39:14.932100: Created local DB for 4000000 records
2019-10-23 17:21:37.257651: Finished inserting 4000000 records
...
As you can see, the first million inserts take 9 seconds, then the next million take 16, then it balloons to 10 minutes, then an hour and 40 minutes (!). Is there something weird I'm doing that causes this crazy slowdown, or is this a limitation of sqlite?