The (regrettably lengthy) MWE at the end of this question is cut down from a real application. It is supposed to work like this: There are two tables. One includes both already-processed and not-yet-processed data, the other has the results of processing the data. On startup, we create a temporary table that lists all of the data that has not yet been processed. We then open a read cursor on that table and scan it from beginning to end; for each datum, we do some crunching (omitted in the MWE) and then insert the results into the processed-data table, using a separate cursor.
This works correctly in autocommit mode. However, if the write operation is wrapped in a transaction -- and in the real application, it has to be, because the write actually touches several tables (all but one of which have been omitted from the MWE) -- then the COMMIT operation has the side-effect of resetting the read cursor on the temp table, causing rows that have already been processed to be reprocessed, which not only prevents forward progress, it causes the program to crash with an IntegrityError
upon trying to insert a duplicate row into data_out
. If you run the MWE you should see this output:
0
1
2
3
4
5
6
7
8
9
10
0
---
127 rows remaining
Traceback (most recent call last):
File "sqlite-test.py", line 85, in <module>
test_main()
File "sqlite-test.py", line 83, in test_main
test_run(db)
File "sqlite-test.py", line 71, in test_run
(row[0], b"output"))
sqlite3.IntegrityError: UNIQUE constraint failed: data_out.value
What can I do to prevent the read cursor from being reset by a COMMIT touching unrelated tables?
Notes: All of the INTEGERs in the schema are ID numbers; in the real application there are several more ancillary tables that hold more information for each ID, and the write transaction touches two or three of them in addition to data_out
, depending on the result of the computation. In the real application, the temporary "data_todo" table is potentially very large -- millions of rows; I started down this road precisely because a Python list was too big to fit in memory. The MWE's shebang is for python3 but it will behave exactly the same under python2 (provided the interpreter is new enough to understand b"..."
strings). Setting PRAGMA locking_mode = EXCLUSIVE;
and/or PRAGMA journal_mode = WAL;
has no effect on the phenomenon. I am using SQLite 3.8.2.
#! /usr/bin/python3
import contextlib
import sqlite3
import sys
import tempfile
import textwrap
def init_db(db):
db.executescript(textwrap.dedent("""\
CREATE TABLE data_in (
origin INTEGER,
origin_id INTEGER,
value INTEGER,
UNIQUE(origin, origin_id)
);
CREATE TABLE data_out (
value INTEGER PRIMARY KEY,
processed BLOB
);
"""))
db.executemany("INSERT INTO data_in VALUES(?, ?, ?);",
[ (1, x, x) for x in range(100) ])
db.executemany("INSERT INTO data_in VALUES(?, ?, ?);",
[ (2, x, 200 - x*2) for x in range(100) ])
db.executemany("INSERT INTO data_out VALUES(?, ?);",
[ (x, b"already done") for x in range(50, 130, 5) ])
db.execute(textwrap.dedent("""\
CREATE TEMPORARY TABLE data_todo AS
SELECT DISTINCT value FROM data_in
WHERE value NOT IN (SELECT value FROM data_out)
ORDER BY value;
"""))
def test_run(db):
init_db(db)
read_cur = db.cursor()
write_cur = db.cursor()
read_cur.arraysize = 10
read_cur.execute("SELECT * FROM data_todo;")
try:
while True:
block = read_cur.fetchmany()
if not block: break
for row in block:
# (in real life, data actually crunched here)
sys.stdout.write("{}\n".format(row[0]))
write_cur.execute("BEGIN TRANSACTION;")
# (in real life, several more inserts here)
write_cur.execute("INSERT INTO data_out VALUES(?, ?);",
(row[0], b"output"))
db.commit()
finally:
read_cur.execute("SELECT COUNT(DISTINCT value) FROM data_in "
"WHERE value NOT IN (SELECT value FROM data_out)")
result = read_cur.fetchone()
sys.stderr.write("---\n{} rows remaining\n".format(result[0]))
def test_main():
with tempfile.NamedTemporaryFile(suffix=".db") as tmp:
with contextlib.closing(sqlite3.connect(tmp.name)) as db:
test_run(db)
test_main()