9

I have some code that writes Scrapy scraped data to a SQL server db. The data items consist of some basic hotel data (name, address, rating..) and some list of rooms with associated data(price, occupancy etc). There can be multiple celery threads and multiple servers running this code and simultaneously writing to the db different items. I am encountering deadlock errors like:

[Failure instance: Traceback: <class 'pyodbc.ProgrammingError'>: 
('42000', '[42000] [FreeTDS][SQL Server]Transaction (Process ID 62)
 was deadlocked on lock resources with another process and has been 
chosen as the deadlock victim. Rerun the transaction. (1205) (SQLParamData)')

The code that actually does the insert/update schematically looks like this:

1) Check if hotel exists in hotels table, if it does update it, else insert it new. 
   Get the hotel id either way. This is done by `curs.execute(...)`

2) Python loop over the hotel rooms scraped. For each room check if room exists 
   in the rooms table (which is foreign keyed to the hotels table). 
   If not, then insert it using the hotel id to reference the hotels table row. 
   Else update it.  These upserts are done using `curs.execute(...)`.

It is a bit more complicated than this in practice, but this illustrates that the Python code is using multiple curs.executes before and during the loop.

If instead of upserting the data in the above manner, I generate one big SQL command, which does the same thing (checks for hotel, upserts it, records the id to a temporary variable, for each room checks if exists and upserts against the hotel id var etc), then do only a single curs.execute(...) in the python code, then I no longer see deadlock errors.

However, I don't really understand why this makes a difference, and also I'm not entirely sure it is safe to run big SQL blocks with multiple SELECTS, INSERTS, UPDATES in a single pyodbc curs.execute. As I understand pyodbc is suppose to only handle single statements, however it does seem to work, and I see my tables populates with no deadlock errors.

Nevertheless, it seems impossible to get any output if I do a big command like this. I tried declaring a variable @output_string and recording various things to it (did we have to insert or update the hotel for example) before finally SELECT @output_string as outputstring, but doing a fetch after the execute in pyodbc always fails with

<class 'pyodbc.ProgrammingError'>: No results.  Previous SQL was not a query.

Experiments within the shell suggest pyodbc ignores everything after the first statement:

In [11]: curs.execute("SELECT 'HELLO'; SELECT 'BYE';")
Out[11]: <pyodbc.Cursor at 0x7fc52c044a50>

In [12]: curs.fetchall()
Out[12]: [('HELLO', )]

So if the first statement is not a query you get that error:

In [13]: curs.execute("PRINT 'HELLO'; SELECT 'BYE';")
Out[13]: <pyodbc.Cursor at 0x7fc52c044a50>

In [14]: curs.fetchall()
---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
<ipython-input-14-ad813e4432e9> in <module>()
----> 1 curs.fetchall()

ProgrammingError: No results.  Previous SQL was not a query.

Nevertheless, except for the inability to fetch my @output_string, my real "big query", consisting of multiple selects, updates, inserts actually works and populates multiple tables in the db.

Nevertheless, if I try something like

curs.execute('INSERT INTO testX (entid, thecol) VALUES (4, 5); INSERT INTO testX (entid, thecol) VALUES (5, 6); SELECT * FROM testX; '
...: )

I see that both rows were inserted into the table tableX, even a subsequent curs.fetchall() fails with the "Previous SQL was not a query." error, so it seems that pyodbc execute does execute everything...not just the first statement.

If I can trust this, then my main problem is how to get some output for logging.

EDIT Setting autocommit=True in the dbargs seems to prevent the deadlock errors, even with the multiple curs.executes. But why does this fix it?

fpghost
  • 2,834
  • 4
  • 32
  • 61
  • Does your "big SQL command" (presumably containing multiple statements) begin with `SET NOCOUNT ON;`? – Gord Thompson Jan 27 '17 at 19:54
  • @GordThompson No, it began with "BEGIN" and ends with "END;", within it are multiple other blocks with "BEGIN, END;", "IF/ELSE", "SELECTS", "INSERT", "UPDATES". – fpghost Jan 27 '17 at 19:55

1 Answers1

13

Setting autocommit=True in the dbargs seems to prevent the deadlock errors, even with the multiple curs.executes. But why does this fix it?

When establishing a connection, pyodbc defaults to autocommit=False in accordance with the Python DB-API spec. Therefore when the first SQL statement is executed, ODBC begins a database transaction that remains in effect until the Python code does a .commit() or a .rollback() on the connection.

The default transaction isolation level in SQL Server is "Read Committed". Unless the database is configured to support SNAPSHOT isolation by default, a write operation within a transaction under Read Committed isolation will place transaction-scoped locks on the rows that were updated. Under conditions of high concurrency, deadlocks can occur if multiple processes generate conflicting locks. If those processes use long-lived transactions that generate a large number of such locks then the chances of a deadlock are greater.

Setting autocommit=True will avoid the deadlocks because each individual SQL statement will be automatically committed, thus ending the transaction (which was automatically started when that statement began executing) and releasing any locks on the updated rows.

So, to help avoid deadlocks you can consider a couple of different strategies:

  • continue to use autocommit=True, or
  • have your Python code explicitly .commit() more often, or
  • use SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED to "loosen up" the transaction isolation level and avoid the persistent locks created by write operations, or
  • configure the database to use SNAPSHOT isolation which will avoid lock contention but will make SQL Server work harder.

You will need to do some homework to determine the best strategy for your particular usage case.

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418