3

I'm doing a personal project that generates a lot of data, and I thought that storing it in a local DB made sense. However, I'm seeing insane slowdown as the DB grows, which makes it infeasible to run.

I made a simple test showing what I'm doing. I make a dictionary where I do a bunch of local processing (roughly 1 million entries), then batch insert that into the SQLite DB, then loop and do it all again. Here's the code:

from collections import defaultdict
import sqlite3
import datetime
import random

def log(s):
    now = datetime.datetime.now()
    print(str(now) + ": " + str(s))

def create_table():
    conn = create_connection()
    with conn:
        cursor = conn.cursor()

        sql = """
            CREATE TABLE IF NOT EXISTS testing (
                test text PRIMARY KEY,
                number integer
            );"""
        cursor.execute(sql)
    conn.close()

def insert_many(local_db):
    sql = """INSERT INTO testing(test, number) VALUES(?, ?) ON CONFLICT(test) DO UPDATE SET number=number+?;"""

    inserts = []
    for key, value in local_db.items():
        inserts.append((key, value, value))

    conn = create_connection()
    with conn:
        cursor = conn.cursor()
        cursor.executemany(sql, inserts)
    conn.close()

def main():
    i = 0
    log("Starting to process records")
    for i in range(1, 21):
        local_db = defaultdict(int)
        for j in range(0, 1000000):
            s = "Testing insertion " + str(random.randrange(100000000))
            local_db[s] += 1
        log("Created local DB for " + str(1000000 * i) + " records")
        insert_many(local_db)
        log("Finished inserting " + str(1000000 * i) + " records")

def create_connection():
    conn = None
    try:
        conn = sqlite3.connect('/home/testing.db')
    except Error as e:
        print(e)

    return conn

if __name__ == '__main__':
    create_table()
    main()

This runs great for a second, then slows down like crazy. Here's the output I just got:

2019-10-23 15:28:59.211036: Starting to process records
2019-10-23 15:29:01.308668: Created local DB for 1000000 records
2019-10-23 15:29:10.147762: Finished inserting 1000000 records
2019-10-23 15:29:12.258012: Created local DB for 2000000 records
2019-10-23 15:29:28.752352: Finished inserting 2000000 records
2019-10-23 15:29:30.853128: Created local DB for 3000000 records
2019-10-23 15:39:12.826357: Finished inserting 3000000 records
2019-10-23 15:39:14.932100: Created local DB for 4000000 records
2019-10-23 17:21:37.257651: Finished inserting 4000000 records
...

As you can see, the first million inserts take 9 seconds, then the next million take 16, then it balloons to 10 minutes, then an hour and 40 minutes (!). Is there something weird I'm doing that causes this crazy slowdown, or is this a limitation of sqlite?

user1902853
  • 399
  • 1
  • 12
  • 2
    well, you're using a text primary key.. – Marat Oct 23 '19 at 16:49
  • sqlite is not exactly the #1 best performance data base out there – zvone Oct 23 '19 at 16:54
  • @Marat Can you expand on that? Are text primary keys notably slower than ints? If so, how can I increase the performance for this specific problem? – user1902853 Oct 23 '19 at 17:10
  • @zvone Do you have suggestions for other DBs that would perform better for this problem? – user1902853 Oct 23 '19 at 17:10
  • I imagine most DBs are faster, but that is not the main problem you are having, it was just a comment to think of if you care for speed. There is a good answer from peak and a comment by marat which address the more important things. – zvone Oct 23 '19 at 17:55
  • what is the on conflict achieving that a non-unique index with a group by can’t do? or a subsequent insert select from a staging table to your actual table? i realize that this is an SO minimal sample, and that you may have your own reasons, preferences and logic, but is this index structure necessary at raw insert point? – JL Peyret Oct 24 '19 at 18:43

2 Answers2

2

Using your program with one (minor?) modification, I got the very reasonable-looking timings shown immediately below. The modification was to use sqlite3.connect instead of pysqlite.connect.

Timings using sqlite3.connect

The # comments are approximations.

2019-10-23 13:00:37.843759: Starting to process records
2019-10-23 13:00:40.253049: Created local DB for 1000000 records
2019-10-23 13:00:50.052383: Finished inserting 1000000 records          # 12s
2019-10-23 13:00:52.065007: Created local DB for 2000000 records
2019-10-23 13:01:08.069532: Finished inserting 2000000 records          # 18s
2019-10-23 13:01:10.073701: Created local DB for 3000000 records
2019-10-23 13:01:28.233935: Finished inserting 3000000 records          # 20s
2019-10-23 13:01:30.237968: Created local DB for 4000000 records
2019-10-23 13:01:51.052647: Finished inserting 4000000 records          # 23s
2019-10-23 13:01:53.079311: Created local DB for 5000000 records
2019-10-23 13:02:15.087708: Finished inserting 5000000 records          # 24s
2019-10-23 13:02:17.075652: Created local DB for 6000000 records
2019-10-23 13:02:41.710617: Finished inserting 6000000 records          # 26s
2019-10-23 13:02:43.712996: Created local DB for 7000000 records
2019-10-23 13:03:18.420790: Finished inserting 7000000 records          # 37s
2019-10-23 13:03:20.420485: Created local DB for 8000000 records
2019-10-23 13:04:03.287034: Finished inserting 8000000 records          # 45s
2019-10-23 13:04:05.593073: Created local DB for 9000000 records
2019-10-23 13:04:57.871396: Finished inserting 9000000 records          # 54s
2019-10-23 13:04:59.860289: Created local DB for 10000000 records       
2019-10-23 13:05:54.527094: Finished inserting 10000000 records # 57s
...

The costs of TEXT PRIMARY KEY

I believe the main reason for the slowing down is the definition of test as TEXT PRIMARY KEY. That entails huge indexing costs, as suggested by this snippet from a run in which the "PRIMARY KEY" and "ON CONFLICT" declarations are removed:

2019-10-23 13:26:22.627898: Created local DB for 10000000 records
2019-10-23 13:26:24.010171: Finished inserting 10000000 records
...
2019-10-23 13:26:58.350150: Created local DB for 20000000 records
2019-10-23 13:26:59.832137: Finished inserting 20000000 records

That's less than 1.4s at the 10-million-records mark, and not much more at the 20-million-records mark.

peak
  • 105,803
  • 17
  • 152
  • 177
  • Okay, but I do need to insert or update (i.e. removing the `ON CONFLICT` declaration also stops my code from working). Is there a way to get the speed up while still checking for conflicts? – user1902853 Oct 23 '19 at 19:15
  • You might like to try APSW: https://rogerbinns.github.io/apsw/ – peak Oct 24 '19 at 08:11
2

(More of an extended comment than an answer)

SQLite only supports BTree indexes. For strings, which might have different length, the tree is storing row ids. Reading complexity of a tree is O(log(n)), where n is the length of the table, however, it will be multiplied by the complexity of reading and comparing a string value from the table. So, unless there is a really good reason to, it is better to have an integer field as a primary key.

What makes the situation worse in this case, the strings you're inserting have fairly long shared prefix ("Testing insertion "), so searching for the first mismatch takes longer.

Speedup suggestions, in order of expected effect size:

  • Real databases (MariaDB, Postgres) support hash indexes, which will solve this problem.
  • disable autocommit (skip unnecessary disk writes; very expensive)
  • reverse the text string (number before fixed text), or even just leave the number part only
  • use mass insertions (multiple records in one statement)

The @peak's answer avoids the whole problem by not using index. If index is not required at all, this is definitely a way to go.

Marat
  • 15,215
  • 2
  • 39
  • 48