1

The (regrettably lengthy) MWE at the end of this question is cut down from a real application. It is supposed to work like this: There are two tables. One includes both already-processed and not-yet-processed data, the other has the results of processing the data. On startup, we create a temporary table that lists all of the data that has not yet been processed. We then open a read cursor on that table and scan it from beginning to end; for each datum, we do some crunching (omitted in the MWE) and then insert the results into the processed-data table, using a separate cursor.

This works correctly in autocommit mode. However, if the write operation is wrapped in a transaction -- and in the real application, it has to be, because the write actually touches several tables (all but one of which have been omitted from the MWE) -- then the COMMIT operation has the side-effect of resetting the read cursor on the temp table, causing rows that have already been processed to be reprocessed, which not only prevents forward progress, it causes the program to crash with an IntegrityError upon trying to insert a duplicate row into data_out. If you run the MWE you should see this output:

0
1
2
3
4
5
6
7
8
9
10
0
---
127 rows remaining
Traceback (most recent call last):
  File "sqlite-test.py", line 85, in <module>
    test_main()
  File "sqlite-test.py", line 83, in test_main
    test_run(db)
  File "sqlite-test.py", line 71, in test_run
    (row[0], b"output"))
sqlite3.IntegrityError: UNIQUE constraint failed: data_out.value

What can I do to prevent the read cursor from being reset by a COMMIT touching unrelated tables?

Notes: All of the INTEGERs in the schema are ID numbers; in the real application there are several more ancillary tables that hold more information for each ID, and the write transaction touches two or three of them in addition to data_out, depending on the result of the computation. In the real application, the temporary "data_todo" table is potentially very large -- millions of rows; I started down this road precisely because a Python list was too big to fit in memory. The MWE's shebang is for python3 but it will behave exactly the same under python2 (provided the interpreter is new enough to understand b"..." strings). Setting PRAGMA locking_mode = EXCLUSIVE; and/or PRAGMA journal_mode = WAL; has no effect on the phenomenon. I am using SQLite 3.8.2.

#! /usr/bin/python3

import contextlib
import sqlite3
import sys
import tempfile
import textwrap

def init_db(db):
    db.executescript(textwrap.dedent("""\
        CREATE TABLE data_in (
            origin    INTEGER,
            origin_id INTEGER,
            value     INTEGER,
            UNIQUE(origin, origin_id)
        );
        CREATE TABLE data_out (
            value     INTEGER PRIMARY KEY,
            processed BLOB
        );
        """))

    db.executemany("INSERT INTO data_in VALUES(?, ?, ?);",
                   [ (1, x, x) for x in range(100) ])
    db.executemany("INSERT INTO data_in VALUES(?, ?, ?);",
                   [ (2, x, 200 - x*2) for x in range(100) ])

    db.executemany("INSERT INTO data_out VALUES(?, ?);",
                   [ (x, b"already done") for x in range(50, 130, 5) ])

    db.execute(textwrap.dedent("""\
        CREATE TEMPORARY TABLE data_todo AS
            SELECT DISTINCT value FROM data_in
            WHERE value NOT IN (SELECT value FROM data_out)
            ORDER BY value;
        """))

def test_run(db):
    init_db(db)

    read_cur  = db.cursor()
    write_cur = db.cursor()

    read_cur.arraysize = 10
    read_cur.execute("SELECT * FROM data_todo;")

    try:
        while True:
            block = read_cur.fetchmany()
            if not block: break
            for row in block:
                # (in real life, data actually crunched here)
                sys.stdout.write("{}\n".format(row[0]))
                write_cur.execute("BEGIN TRANSACTION;")
                # (in real life, several more inserts here)
                write_cur.execute("INSERT INTO data_out VALUES(?, ?);",
                                  (row[0], b"output"))
                db.commit()

    finally:
        read_cur.execute("SELECT COUNT(DISTINCT value) FROM data_in "
                         "WHERE value NOT IN (SELECT value FROM data_out)")
        result = read_cur.fetchone()
        sys.stderr.write("---\n{} rows remaining\n".format(result[0]))

def test_main():
    with tempfile.NamedTemporaryFile(suffix=".db") as tmp:
        with contextlib.closing(sqlite3.connect(tmp.name)) as db:
            test_run(db)

test_main()
zwol
  • 135,547
  • 38
  • 252
  • 361
  • Use two connections, perhaps? – Martijn Pieters Feb 24 '14 at 18:07
  • @MartijnPieters Two connections works! Make that an answer and I'll accept it. – zwol Feb 24 '14 at 18:17
  • @Zack What does MWE stand for? I understand you are referring to the code snippets, I was just curious what it stands for though. – Cam Feb 24 '14 at 18:26
  • @Cam Minimized Working Example. Local jargon for a program that has just enough code to demonstrate the problem you want help with, but no more. MWEs often do things that look silly -- for instance, the main loop of this program could be replaced with `INSERT OR IGNORE INTO data_out SELECT DISTINCT value, CAST('output' AS BLOB) FROM data_in` -- and so it's important to be clear about whether code in a question is an MWE rather than a complete program, and where one has cut out whole blocks of code because they aren't relevant to the question. – zwol Feb 24 '14 at 18:45

1 Answers1

1

Use a second, separate connection for the temporary table, it'll be unaffected by commits on the other connection.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • won't this make the commits run on a foreign lock when the select cursor is not yet exhaused: `sqlite3.OperationalError: database is locked`? – flaschbier Nov 08 '15 at 18:31
  • @flaschbier: the separate connection only *reads*, which should be fine. Not that I have implemented this myself, the OP reported this as working. – Martijn Pieters Nov 08 '15 at 18:54
  • Actually I did, and it failed. I will follow up on this because I am facing a similar issue and will report my findings. Thanks anyway to pointing me to that. – flaschbier Nov 08 '15 at 19:02