Batch Updates taking too long. How can I improve performance

Question

I need to update a large table, roughly 3B rows. Each record needs to update 4 fields of decimal(18,2) datatype. I have recreated the table with the corrected data to update the original table. Below is the statement I have been testing. It works, but It takes about 30 minutes to update 6m records. This means that updating the Original table will take a very long time.

First, when I test this statement it starts off pretty fast but then slows down significantly as more records are updated. Why is this the case and what can I alter to speed up the process?

Second, I have been working on simplifying the joins and where clauses, but this was a pretty complex table to begin with. Based on the statement below, what else can I do to improve performance and shorten the time to update? For example, I know its standard to do batches in rows of 10k, but is it worthwhile to try more like 100k or even 1m at a time? Or would that cause some other unforseen issue like filling up logs disk space or filling up disk space in the temp directory?

use Stage 
go
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
go
DECLARE @batchId INT
DECLARE @batchSize INT
DECLARE @results INT 

SET @results = 1
SET @batchSize = 10000 
SET @batchId = 0 -- when 0 rows returned, exit the loop 
Drop table If Exists  #tempupdate
select 
  Identity(int,1,1) as rowkey,
  a.Rm835HeaderId
 ,a.Rm835ClaimHeaderId
 ,a.Rm835ClaimPmtId
 ,a.Rm835ClaimServiceLineId
 ,a.ClaimLineInd
 ,a.ClaimAdjustGroupCode
 ,a.RemitDate
 ,b.totalclaimchargeamt as TotalClaimChargeAmt_d 
 ,b.ClaimPaymentAmt as ClaimPaymentAmt_d
 ,b.LineitemChargeamt as lineitemchargeamt_d
 ,b.Amount as amount_d
 ,(a.amount - b.amount) as difff 
 into #tempupdate
from  [stage].[stage].[DetailRefreshtest]a
  inner  join [RealMedstage].[reporting].[DetailRefresh] b on 
      a.HeaderId = b.HeaderId
  and a.ClaimHeaderId = b.ClaimHeaderId
  and a.ClaimPmtId = b.ClaimPmtId
  and a.ClaimServiceLineId = b.ClaimServiceLineId
  and a.ClaimLineInd = b.ClaimLineInd
  and a.ClaimAdjustGroupCode = b.ClaimAdjustGroupCode
  where  a.RemitDate between '2021-04-01 00:00:00.000'and '2021-04-02 00:00:00.000' 
  and b.RemitDate between '2021-04-01 00:00:00.000'and '2021-04-02 00:00:00.000'
  and a.amount is not  null  and abs(a.amount - b.amount)<=.5 
WHILE (@results > 0)
BEGIN 
BEGIN TRY
BEGIN TRAN;


update [stage].[stage].[DetailRefreshtest] 
 set   [TotalClaimChargeAmt] =b.TotalClaimChargeAmt_d 
      ,[ClaimPaymentAmt] = b.ClaimPaymentAmt_d
      ,[LineItemChargeAmt] = b.lineitemchargeamt_d
      ,[Amount] = b.amount_d
      ,[opsinsertid]=0
 
 from  #tempupdate b
  inner [stage].[stage].[DetailRefreshtest] a  on 
      a.HeaderId = b.HeaderId
  and a.ClaimHeaderId = b.ClaimHeaderId
  and a.ClaimPmtId = b.ClaimPmtId
  and a.ClaimServiceLineId = b.ClaimServiceLineId
  and a.ClaimLineInd = b.ClaimLineInd
  and a.ClaimAdjustGroupCode = b.ClaimAdjustGroupCode
  where a.amount is not  null  and abs(b.difff)<=.5    --(a.amount - b.amount_d)<=.5 and (a.amount - b.amount_d)>=-.5
  and (b.rowkey >@batchId and b.rowkey<= @batchId + @batchSize)
  
SET @results = @@ROWCOUNT 
-- next batch 
SET @batchId = @batchId + @batchSize 
COMMIT TRAN; 
END TRY 
BEGIN CATCH
   SELECT 
      ERROR_NUMBER() AS ErrorNumber,
      ERROR_SEVERITY() AS ErrorSeverity,
      ERROR_STATE() AS ErrorState,
      ERROR_PROCEDURE() AS ErrorProcedure,
      ERROR_LINE() AS ErrorLine,
      ERROR_MESSAGE() AS ErrorMessage

    rollback tran;
END CATCH
End

Please include the full table definition and its indexes/keys and an example of the [actual execution plan](https://www.brentozar.com/pastetheplan) — Stu, Apr 29 '23 at 19:43
your statement looks wrong, update [stage].[stage].[DetailRefreshtest], but there's no such table in the actual update statement — siggemannen, Apr 29 '23 at 20:26
best would be to create same PK on #t-table as in the original table — siggemannen, Apr 29 '23 at 20:27
The update statement will be trashing the target table because there's no correlation between `[stage].[stage].[DetailRefreshtest]`, `#tempupdate` and `[RealMedstage].[stage].[RemitAdjustmentDetailRefreshtest]`. — AlwaysLearning, Apr 30 '23 at 00:46
@siggemannen , Sorry the [RealMedstage].[stage].[RemitAdjustmentDetailRefreshtest] was a mistake on my part. This table is supposed to be the [stage].[stage].[DetailRefreshtest] — user8675309, May 01 '23 at 12:30

score 0 · Answer 1 · answered Apr 30 '23 at 06:18

Let's say

You have the two tables - [RealMedstage].[stage].[RemitAdjustmentDetailRefreshtest] which we'll call a and #tempupdate which we'll call b
Both of these tables are large
a has a primary key (or, more importantly, clustered index) called HeaderId. If your primary key is non-clustered, then replace every place I say 'primary key' below with 'clustered index'.
b has no primary key but does have an auto-incrementing int called RowKey

Every time you do a join between a and b, SQL server will need to read the entirety of b because it has no way to know which rows are the relevant ones. To do 1 billion rows @ 10,000 rows per batch, you'll need to run 100,000 batches - and each one has a full read of b.

This is slowing you down. Instead, treat b as you would any other table with regards to indexing and query speed.

Using the same approach you are using (e.g., batching etc), you can reduce the extra scan time with the following

Determine the top 10,000 (or so) rows in b with a single seek
Join b to a on a's primary key/clustered index - and preferably with b already sorted in the same way.

Step 1 - big improvement

Given that the filtering mechanism you're using RowKey, you want your data sorted (indexed) on RowKey so it's easy to identify the relevant 10,000 you want.

You can either create the temp table first with a primary key/clustered index; or you could create it afterwards. To create it afterwards, use something like the following.

CREATE CLUSTERED INDEX #CX_temptable ON #temptable (RowKey);

Step 2 - more improvement

It would be somewhat faster if you could ensure that RowKey is in the same order as the clustered index of a (e.g., HeaderId). This means that the number of writes to b are minimised (instead of jumping around everywhere to write the data, instead it is grouped together on the data pages).

You could ensure the sort by removing the IDENTITY on RowKey, and instead make it NULL to start with (e.g., CAST(NULL AS int) AS RowKey), and then update it as ROW_NUMBER() ordered by the clustered index of a e.g.,

WITH SortedTempTable AS (SELECT RowKey, ROW_NUMBER() OVER (ORDER BY HeaderID) AS rn FROM #temptable)
  UPDATE  SortedTempTable
  SET     RowKey = rn;

Note that it makes a lot of sense to do this before creating the clustered index above.

Thanks for your response. You have helped clear a few things up, but ON issue is that the A table (~3b rows) doesn't have any keys. All it has is a Clustered Columstore Index. I can certainly make the changes so that the the B table has a PK based on Rowkey, but the A table doesn't have a column that can serve as a PK since none of those columns are unique, Unless of course I am misunderstanding something. — user8675309, May 01 '23 at 12:47

Batch Updates taking too long. How can I improve performance

1 Answers1