4

My team is struggling with a strange Microsoft SQL Server issue regarding concurrent threads/processes. Very rarely, there seems to be a race condition between an UPDATE and a simultaneous SELECT on two different threads. It is difficult to reproduce because the timing is critical.

When the conflict happens, the SELECT does not return the desired record, although no errors are generated. Instead the SELECT behaves as if the record doesn't exist, which is not true. As a result our service becomes confused, and there is some chaos until we can recover.

To eliminate the possibility of bugs in our service, I reproduced the issue using plain C# code and System.Data.SqlClient. I'll gladly provide that code upon request. I have tried to reduce the code and the table to the minimum required to reproduce the issue.

A test table is created like this:

CREATE TABLE [TestTable] 
(
    [RecordId] int IDENTITY(1,1) UNIQUE,
    [Locked] int NOT NULL DEFAULT 0,
    [Priority] int NULL,
    [Status] int NULL,
    [SystemName] nvarchar(50) NULL
)

The table has an index created like this:

CREATE NONCLUSTERED INDEX [IDX_GENERAL] 
ON [TestTable] ([Status], [SystemName], [Locked]) 
INCLUDE ([RecordId], [Priority])

The table contains a single record created like this:

INSERT INTO [TestTable] ([Locked], [Priority], [Status], [SystemName])
VALUES (0, 3, 3000, 'System1')

In our service, multiple threads and processes acquire exclusive write access to records by atomically setting their 'Locked' field to one. Then after making any necessary modifications, a thread/process must release the record by resetting its 'Locked' column to zero. What I see is an extremely rare situation when one thread unlocks a record, and simultaneously another thread tries to find it:

One thread performs an UPDATE like this (which succeeds):

UPDATE [TestTable] 
SET [Locked] = 0 
OUTPUT INSERTED.* 
WHERE [Locked] = 1 
  AND [RecordId] = 1 
  AND [Status] = 3000 
  AND [SystemName] = 'System1'

At the same moment, another thread performs a SELECT like this (which comes back empty):

SELECT [RecordId] 
FROM [TestTable] 
WHERE [Status] = 3000 
  AND [SystemName] = 'System1' 
ORDER BY Priority DESC, RecordId ASC

I believe the index is part of the problem, because if I remove the Status or SystemName keys, then I can no longer reproduce the issue. I have no clue what would cause this behavior. Everything I have read says that this simply cannot happen.

I would welcome any questions, ideas, or suggestions about how to troubleshoot this...

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Mike
  • 49
  • 2
  • Can you add `WITH NO LOCK` to the Select Query..? – MethodMan Oct 19 '17 at 20:44
  • Thank you for your response. I have tried 'WITH (NOLOCK)' on the SELECT, but it doesn't make a difference. – Mike Oct 19 '17 at 20:53
  • Is your table a heap? e.g. `Id` is `unique` and not `primary key`, no clustered index specified. – SqlZim Oct 19 '17 at 20:58
  • I would suggest not to use direct select/update queries in this case, but rather use stored procedures instead. And in stored procedures handle checking for [Locked] behavior. – GSazheniuk Oct 19 '17 at 20:58
  • For `SELECT [RecordId] FROM [TestTable]` what isolation level is the query running at? Is Snapshot Isolation also enabled? – mjwills Oct 19 '17 at 21:28
  • The database has: SET ALLOW_SNAPSHOT_ISOLATION OFF and SET READ_COMMITTED_SNAPSHOT OFF. Unfortunately the real database spec for our application is about 20 years old, and a LOT of our processes use it, so I don't want to risk changing the database itself. The table can be changed easily. Still I want to understand why this happens, in case our code is doing something wrong. – Mike Oct 20 '17 at 16:13
  • The RecordId is UNIQUE, but not a PRIMARY KEY. I have no clustered index, but SQL automatically created a non-clustered index for it. The SELECT statement is running at the default isolation level, which as I understand it is READ COMMITTED. – Mike Oct 20 '17 at 16:14
  • I uploaded the source code of my test application here: https://app.box.com/s/gr6ux8w8sogivrqcda2zpr72rpy7ys9e – Mike Oct 20 '17 at 18:29
  • Inside the upload there are two .sql scripts for creating the database and tables. In the application you will need to modify the s_Connect string at the top for your environment. Sometimes the conflict happens quickly, other times it can take several minutes or more. The output will say "record does not exist!" which is clearly not true. – Mike Oct 20 '17 at 18:35
  • Possible duplicate of [Spring Data/Hibernate MS SQL Server unique constraint and race condition](https://stackoverflow.com/questions/46618422/spring-data-hibernate-ms-sql-server-unique-constraint-and-race-condition) – Alex Oct 21 '17 at 02:47
  • See my answer is the above link, as well as linked article. – Alex Oct 21 '17 at 02:49
  • Thank you, I read through the links you provided, but I'm not seeing the relevance to my issue. In particular, I don't see what I would need to change in the code or the table to ensure that the SELECT always returns the record. It does return the record 99.999% of the time, but very rarely it does not. – Mike Oct 23 '17 at 16:46
  • I tested using stored procedures and it didn't make a difference. However, I may have stumbled upon another solution. I moved the 'Locked' value into the INCLUDE section of the index, and ever since then I have not been able to reproduce the issue. I am happy but mystified by this. Can anyone offer an explanation? – Mike Oct 30 '17 at 19:53
  • Any reason why you aren't using transactions? – Jason Geiger Nov 22 '17 at 19:32

1 Answers1

0

There is a flaw in your logic. At start many threads try to get id of the record for lock.

 SELECT [RecordId] 
 FROM [TestTable] 
 WHERE [Locked]=0 AND [Status]=3000 AND [SystemName]='System1' 
 ORDER BY Priority DESC, RecordId ASC

Then they try to set lock

UPDATE [TestTable] 
SET [Locked]=1 
OUTPUT INSERTED.* 
WHERE [Locked]=0 AND [RecordId]={0} 
      AND [Status]=3000 AND [SystemName]='System1'

Naturally it can happen that

  1. The first thread get ID of unlocked record
  2. The second thread get ID of unlocked record
  3. The second thread tries to lock record and succeed
  4. The first thread tries to lock record and fails

It's a race condition bug.

A race condition is a bug that occurs when the outcome of a program depends on which of two or more threads reaches a particular block of code first. Running the program many times produces different results, and the result of any given run cannot be predicted.

In your case it's not a bug but some rare case for your logic.
You have two options

  1. Open common transaction for both queries. (Say hello to DB locks)
  2. Take into account that on execution second query record can be already locked by another thread. (I would choose this option)
finesoul
  • 151
  • 2
  • 6