My team is struggling with a strange Microsoft SQL Server issue regarding concurrent threads/processes. Very rarely, there seems to be a race condition between an UPDATE and a simultaneous SELECT
on two different threads. It is difficult to reproduce because the timing is critical.
When the conflict happens, the SELECT
does not return the desired record, although no errors are generated. Instead the SELECT
behaves as if the record doesn't exist, which is not true. As a result our service becomes confused, and there is some chaos until we can recover.
To eliminate the possibility of bugs in our service, I reproduced the issue using plain C# code and System.Data.SqlClient
. I'll gladly provide that code upon request. I have tried to reduce the code and the table to the minimum required to reproduce the issue.
A test table is created like this:
CREATE TABLE [TestTable]
(
[RecordId] int IDENTITY(1,1) UNIQUE,
[Locked] int NOT NULL DEFAULT 0,
[Priority] int NULL,
[Status] int NULL,
[SystemName] nvarchar(50) NULL
)
The table has an index created like this:
CREATE NONCLUSTERED INDEX [IDX_GENERAL]
ON [TestTable] ([Status], [SystemName], [Locked])
INCLUDE ([RecordId], [Priority])
The table contains a single record created like this:
INSERT INTO [TestTable] ([Locked], [Priority], [Status], [SystemName])
VALUES (0, 3, 3000, 'System1')
In our service, multiple threads and processes acquire exclusive write access to records by atomically setting their 'Locked' field to one. Then after making any necessary modifications, a thread/process must release the record by resetting its 'Locked' column to zero. What I see is an extremely rare situation when one thread unlocks a record, and simultaneously another thread tries to find it:
One thread performs an UPDATE like this (which succeeds):
UPDATE [TestTable]
SET [Locked] = 0
OUTPUT INSERTED.*
WHERE [Locked] = 1
AND [RecordId] = 1
AND [Status] = 3000
AND [SystemName] = 'System1'
At the same moment, another thread performs a SELECT
like this (which comes back empty):
SELECT [RecordId]
FROM [TestTable]
WHERE [Status] = 3000
AND [SystemName] = 'System1'
ORDER BY Priority DESC, RecordId ASC
I believe the index is part of the problem, because if I remove the Status
or SystemName
keys, then I can no longer reproduce the issue. I have no clue what would cause this behavior. Everything I have read says that this simply cannot happen.
I would welcome any questions, ideas, or suggestions about how to troubleshoot this...