I have some code that writes Scrapy scraped data to a SQL server db. The data items consist of some basic hotel data (name, address, rating..) and some list of rooms with associated data(price, occupancy etc). There can be multiple celery threads and multiple servers running this code and simultaneously writing to the db different items. I am encountering deadlock errors like:
[Failure instance: Traceback: <class 'pyodbc.ProgrammingError'>:
('42000', '[42000] [FreeTDS][SQL Server]Transaction (Process ID 62)
was deadlocked on lock resources with another process and has been
chosen as the deadlock victim. Rerun the transaction. (1205) (SQLParamData)')
The code that actually does the insert/update schematically looks like this:
1) Check if hotel exists in hotels table, if it does update it, else insert it new.
Get the hotel id either way. This is done by `curs.execute(...)`
2) Python loop over the hotel rooms scraped. For each room check if room exists
in the rooms table (which is foreign keyed to the hotels table).
If not, then insert it using the hotel id to reference the hotels table row.
Else update it. These upserts are done using `curs.execute(...)`.
It is a bit more complicated than this in practice, but this illustrates that the Python code is using multiple curs.executes
before and during the loop.
If instead of upserting the data in the above manner, I generate one big SQL command, which does the same thing (checks for hotel, upserts it, records the id to a temporary variable, for each room checks if exists and upserts against the hotel id var etc), then do only a single curs.execute(...)
in the python code, then I no longer see deadlock errors.
However, I don't really understand why this makes a difference, and also I'm not entirely sure it is safe to run big SQL blocks with multiple SELECTS, INSERTS, UPDATES in a single pyodbc curs.execute
. As I understand pyodbc is suppose to only handle single statements, however it does seem to work, and I see my tables populates with no deadlock errors.
Nevertheless, it seems impossible to get any output if I do a big command like this. I tried declaring a variable @output_string
and recording various things to it (did we have to insert or update the hotel for example) before finally SELECT @output_string as outputstring
, but doing a fetch after the execute in pyodbc always fails with
<class 'pyodbc.ProgrammingError'>: No results. Previous SQL was not a query.
Experiments within the shell suggest pyodbc ignores everything after the first statement:
In [11]: curs.execute("SELECT 'HELLO'; SELECT 'BYE';")
Out[11]: <pyodbc.Cursor at 0x7fc52c044a50>
In [12]: curs.fetchall()
Out[12]: [('HELLO', )]
So if the first statement is not a query you get that error:
In [13]: curs.execute("PRINT 'HELLO'; SELECT 'BYE';")
Out[13]: <pyodbc.Cursor at 0x7fc52c044a50>
In [14]: curs.fetchall()
---------------------------------------------------------------------------
ProgrammingError Traceback (most recent call last)
<ipython-input-14-ad813e4432e9> in <module>()
----> 1 curs.fetchall()
ProgrammingError: No results. Previous SQL was not a query.
Nevertheless, except for the inability to fetch my @output_string
, my real "big query", consisting of multiple selects, updates, inserts actually works and populates multiple tables in the db.
Nevertheless, if I try something like
curs.execute('INSERT INTO testX (entid, thecol) VALUES (4, 5); INSERT INTO testX (entid, thecol) VALUES (5, 6); SELECT * FROM testX; '
...: )
I see that both rows were inserted into the table tableX
, even a subsequent curs.fetchall()
fails with the "Previous SQL was not a query." error, so it seems that pyodbc execute does execute everything...not just the first statement.
If I can trust this, then my main problem is how to get some output for logging.
EDIT Setting autocommit=True
in the dbargs seems to prevent the deadlock errors, even with the multiple curs.executes. But why does this fix it?