Deadlock while purging old records of spring batch job tables at scheduled time using shedlock

Question

I came across this challenging scenario wherein the transaction gets locked by Microsfoft sql server while executing a batch update query.

I see this error.

Transaction (Process ID 293) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.; nested exception is com.microsoft.sqlserver.jdbc.SQLServerException: Transaction (Process ID 293) was deadlocked on lock resources with another process and has been chosen as the deadlock victim.

My batch update query:

 jdbcTemplate.batchUpdate(purgeQueryArray)

I have 7-8 tables whose data needs to be purged if older than 7 days. Now, in case of lower environment since data volume is low, so it works fine. In production, we have data as much as 300k to 500k in each table. While deleting these many records, shedlock task which executes spring jdbc query ends up in deadlock. API for the same operation works fine but executed at different time so not sure of the load at the time scheduled task runs.

@Scheduled(cron = "${scheduler.expression}", zone = "GMT")
    @SchedulerLock(name = "SCHEDULER_LOCK", lockAtLeastFor = "10S", lockAtMostFor = "5M")
    public void purge() {
       // prepare array of queries purgeQueryArray
       jdbcTemplate.batchUpdate(purgeQueryArray)

    }

Shedlock table data:

SCHEDULER_LOCK 2020-10-21 00:00:15 2020-10-21 00:00:00 tomcat-406116080-2-521278230

Though I have given lockAtMostFor=5M , looks like lock_until shows 15 sec which is strange. can that be the reason ? Because operation will take 1-2 minute for the production volume of data.

Any suggestion will be very much appreciated

EDIT:

DELETE FROM BATCH_STEP_EXECUTION_CONTEXT WHERE STEP_EXECUTION_ID IN (SELECT BE.STEP_EXECUTION_ID FROM BATCH_STEP_EXECUTION BE 
join BATCH_STEP_EXECUTION_CONTEXT BEC on BE.STEP_EXECUTION_ID = BEC.STEP_EXECUTION_ID
where CAST(LAST_UPDATED as date) < DATEADD(day, -7, GETDATE()));

DELETE FROM BATCH_STEP_EXECUTION WHERE JOB_EXECUTION_ID IN (SELECT JOB_EXECUTION_ID FROM BATCH_JOB_EXECUTION
where CAST(LAST_UPDATED as date) < DATEADD(day, -7, GETDATE()));

DELETE FROM BATCH_JOB_EXECUTION_CONTEXT WHERE JOB_EXECUTION_ID IN (SELECT JOB_EXECUTION_ID FROM  BATCH_JOB_EXECUTION
where CAST(LAST_UPDATED as date) < DATEADD(day, -7, GETDATE()));

DELETE FROM BATCH_JOB_EXECUTION_PARAMS WHERE JOB_EXECUTION_ID IN (SELECT JOB_EXECUTION_ID FROM BATCH_JOB_EXECUTION
where CAST(LAST_UPDATED as date) < DATEADD(day, -7, GETDATE()));

DELETE FROM BATCH_JOB_EXECUTION WHERE CAST(LAST_UPDATED as date) < DATEADD(day, -7, GETDATE());

DELETE FROM BATCH_JOB_INSTANCE WHERE JOB_INSTANCE_ID NOT IN (SELECT JOB_INSTANCE_ID FROM BATCH_JOB_EXECUTION);

Thanks in advance

The deadlocks are caused by the *queries*, not SQL Server. We can't help without the queries. SQL Server [stores a lot of information about deadlocks](https://www.brentozar.com/archive/2014/06/capturing-deadlock-information/) including which processes, resources and queries were involved. In Extended Events, check the `system_health session` session and look for `xml_deadlock_report` events — Panagiotis Kanavos, Oct 21 '20 at 19:01
BTW `batch update` sounds like something that's going to take a lot of exclusive locks and quite possibly block other connections that try to read or modify records on the same table. If badly written, eg if the script starts by reading everything, modifying on the client then sending it back, you'll end up with Shared locks on all rows, blocking any other connection that tries to modify rows. If two jobs start reading/updating, they can easily end up waiting for each other to finish updating. Using eg HOLDLOCK will make things a lot worse. — Panagiotis Kanavos, Oct 21 '20 at 19:02
`as much as 300k to 500k in each table.` that's not a lot of data. — Panagiotis Kanavos, Oct 21 '20 at 19:06
Thanks Panagiotis for quick response.. I have 8 queries and I dont want to hit DB for every delete query as I have 8 delete queries and will increase further in future.. So, I use batchupdate.. what do you think would be right choice here — Saawan, Oct 21 '20 at 19:09
BATCH_STEP_EXECUTION_CONTEXT BATCH_STEP_EXECUTION BATCH_JOB_EXECUTION_CONTEXT BATCH_JOB_EXECUTION_PARAMS BATCH_JOB_EXECUTION BATCH_JOB_INSTANCE I delete from these tables based on LAST_UPDATED so they are dependent table. batch update ensures they are deleted sequentially together — Saawan, Oct 21 '20 at 19:12
Post the queries. Read the Extended Events. It's *impossible* to say what's wrong as there's no usable information in the question. Only indirect hints that suggest something's wrong — Panagiotis Kanavos, Oct 21 '20 at 19:13
We have isolation level as ISOLATION_REPEATABLE_READ for JobRepositoryFactoryBean — Saawan, Oct 21 '20 at 19:14
`I delete from these tables based on LAST_UPDATED` if that field isn't indexed, the server has to scan *all rows* in the table to find matches. Which means it has to lock all rows, to ensure the result set doesn't change — Panagiotis Kanavos, Oct 21 '20 at 19:14
You still haven't posted the query or the deadlock graph. REPEATABLE READ means that using an unindexed column will end up locking every row to ensure that query doesn't change. — Panagiotis Kanavos, Oct 21 '20 at 19:15
BTW you could delete old records instantaneously if you used partitioning based on time, and [truncate the oldest partition](https://www.mssqltips.com/sqlservertip/4436/sql-server-2016-truncate-table-with-partitions/) — Panagiotis Kanavos, Oct 21 '20 at 19:19
`where CAST(LAST_UPDATED as date) < DATEADD(day, -7, GETDATE()));` results in bad statistic estimates. The server can convert that cast into a range query but *can't* guess how many indexed values match that condition. Since your queries are *joins*, not simple DELETEs, this will affect the JOIN strategy. Cast the *right-hand* expression instead - `where last_updated < cast(dateadd(day, -6, getdate())`. By using `-6` instead of `-7` you include all date+times in the last day — Panagiotis Kanavos, Oct 22 '20 at 05:54
You still haven't posted the deadlock graph. So while you can improve that query, you don't know what other queries need fixing. BTW your queries are equivalent to JOINS and could be simplified. The DELETE operation will take locks on *all* the tables involved, which means you get exclusive locks on multiple tables from the very first DELETE — Panagiotis Kanavos, Oct 22 '20 at 05:56
You can use the [Snapshot isolation level](https://www.sqlshack.com/snapshot-isolation-in-sql-server/) either at the database level or through `SET ISOLATION LEVEL` so readers and writers don't block each other. `REPEATABLE READ` is a stricter level than the common `READ COMMITTED` level too, so that plays an important role too. — Panagiotis Kanavos, Oct 22 '20 at 06:01

score 1 · Accepted Answer · answered Oct 21 '20 at 23:48

Is LAST_UPDATED from BATCH_STEP_EXECUTION? (I like to put an alias on every column because it's hard to read otherwise.)

A deadlock will be more likely if it locks more or locks longer. It's important to optimize the query. And if that's not possible, then attempt to use small batch sizes to minimize the size of the transaction.

The first thing I would do is copy the data to a test environment. The first test I would try is to set the date back far enough to exclude any records. If it's slow, then it's doing a scan. A small batch size is not going to help - it might make it much worse.

The WHERE clauses with CAST(LAST_UPDATED as date) < DATEADD(day, -7, GETDATE())) covers the column and a table scan will result even if there is an index. Can you compare LAST_UPDATED directly to a local variable of exactly the same type?

Perhaps check the execution plan. It might indicate a problem.

Another option is to get the IDs first if they are stable enough. Then do the delete in a separate transaction using the table with the data to delete and perhaps a temp table. I would loop through the delete in batches.

If the ID an identity column or otherwise monotonically increasing, get the oldest ID to keep. That should be fast. Then delete all that have a smaller ID. (Be sure this is valid logic.)

Maybe avoid the IN clause? Is this the same result?

DELETE BEC
FROM BATCH_STEP_EXECUTION_CONTEXT BEC
INNER JOIN BATCH_STEP_EXECUTION BE 
ON BE.STEP_EXECUTION_ID = BEC.STEP_EXECUTION_ID
WHERE LAST_UPDATED < @LAST_UPDATED_LIMIT -- uncover the column if possible

Fun stuff. Good luck.

I should have mentioned that measuring the logical reads is important. It's probably very large. (My first comment. Thought I had to have 50 points.) — Randy in Marin, Oct 21 '20 at 23:54
`a table scan will result even if there is an index.` no, the server converts `cast(last_updated as date)` to a range search and uses indexes. It can't calculate statistics though, so it's highly likely the JOIN strategies won't be optimal — Panagiotis Kanavos, Oct 22 '20 at 06:02
I used the word "covers". I should have perhaps said this makes the query non-SARGable instead to avoid confusion with a covering index. But if it uses a ranged search and indexes, is it SARGable? — Randy in Marin, Oct 28 '20 at 19:19

Deadlock while purging old records of spring batch job tables at scheduled time using shedlock

1 Answers1