0

We are using full recovery model in SQL Server. We have a job which merges from a staging table to the final table. The staging table is holding millions of rows. The final table is also huge with millions of rows. We are merging in batches of 10,000 rows.

The pseudo code is given for a single batch below:

BEGIN TRANSACTION

DELETE TOP 10000 * 
FROM <Staging Table> 
OUTPUT deleted.* INTO @TableVariable

MERGE INTO <Final Table> 
USING @TableVariable

COMMIT TRANSACTION

The problem is, the batch operation is getting slower, for every new batch. When we restart the server, the batches are getting faster again. The transactions are not getting written to disk also and taking very long time to insert to disk. We suspect it to be problem with transaction log. When we reduce the batch size, more transactions are happening and batches are slowing down even more.

Is there a way to improve the performance of this kind of batched delete & merge operation? Do you recommend to use CHECKPOINT to force in full recovery model?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Venkataraman R
  • 12,181
  • 2
  • 31
  • 58
  • Just a thought; joining a "huge" table to a table variable might not be the best route, since it's always going to treat the table variable like it contains 1 row. – Xedni Aug 11 '17 at 18:31
  • I get it. when we do batches, we load every 10,000 rows in table variable and use the same for merge. For first time, it is fast and getting slower, as every batch progresses. My question is how to avoid the slowing down of future batches. – Venkataraman R Aug 11 '17 at 18:40

2 Answers2

1

Merge operations can often be improved by avoiding superflous updates. If there is nothing to update because target and source rows are equals, then do not update the row. This is very effective on cases where most of rows have not changed, because SQL Server writes much less information on the transaction log.

To avoid superflous updates on merge operations write the merge statement like this:

MERGE INTO target AS t
USING source AS s
ON t.id = s.id
WHEN MATCHED 
  AND ((t.col1 <> s.col1 
       OR t.col1 IS NULL AND s.col1 IS NOT NULL
       OR t.col1 IS NOT NULL AND s.col1 IS NULL)
  OR (t.col2 <> s.col2 
       OR t.col2 IS NULL AND s.col2 IS NOT NULL
       OR t.col2 IS NOT NULL AND s.col2 IS NULL)
  OR (t.col2 <> s.col3 
       OR t.col3 IS NULL AND s.col3 IS NOT NULL
       OR t.col3 IS NOT NULL AND s.col3 IS NULL))
THEN UPDATE SET
  col1 = s.col1, col2 = s.col2, col3 = s.col3
WHEN NOT MATCHED BY TARGET THEN 
    INSERT (id, col1, col2, col3)
    VALUES (s.id, s.col1, s.col2, s.col3);
Venkataraman R
  • 12,181
  • 2
  • 31
  • 58
Jesús López
  • 8,338
  • 7
  • 40
  • 66
  • I think you should have OR condition here, as one of the columns might have changed. In our case, we had close to 100+ columns. So, we went for simple merge operation WHEN MATCHED. Superflous merge can be avoided in case of small set of columns. But, with large columns, I think the length of transaction could get more. – Venkataraman R Nov 27 '19 at 03:43
  • You are right. Changed to OR. Obviously, with a lot of columns more CPU is needed, but you won't know if it is worth it until you try and measure. – Jesús López Nov 27 '19 at 05:53
  • agreed. this project, we went ahead with delay approach and it solved our problem. As I told you, it was long back project and yesterday I realized that, answer was missing for this question. so, updated the approach, we tried. I have upvoted your answer, as it is a good approach. unfortunately, we did not think of this and could not try it . – Venkataraman R Nov 27 '19 at 06:09
0

What we did is, instead of forcing the CHECKPOINT process, we introduced artificial delay in the WHILE LOOP, so that transactions are not getting throttled.

We were able to overcome the problem for out of memory issues due to transactions throttling in the SQL Server environment. We had millions of rows in the staging table. This 10,000 batch & delay introduced made sure that we are not overloading the server. There were people accessing the server.


DECLARE @RowCount INT;

SET @RowCount = (SELECT COUNT(*) FROM StagingTable);

WHILE (@RowCount > 0)
BEGIN

    BEGIN TRANSACTION

    DELETE TOP 10000 * 
    FROM <Staging Table> 
    OUTPUT deleted.* INTO @TableVariable

    MERGE INTO <Final Table> 
    USING @TableVariable

    COMMIT TRANSACTION

    WAITFOR DELAY '00:00:10'; --artificially introduce 10 seconds delay

    SET @RowCount = (SELECT COUNT(*) FROM StagingTable);

END 

Venkataraman R
  • 12,181
  • 2
  • 31
  • 58
  • this is long back project. The question was missing answer. I have put the answer now. We had a scenario where we need to remove data from staging table and MERGE into final table. So, we had to go for DELETE , followed by MERGE. – Venkataraman R Nov 26 '19 at 06:48
  • 1
    Yes. I mean that by avoiding superflous updates you can improve the merge operation because less information is written on the transaction log. Use `WHEN MATCHED AND source.col1 <> target.col2 AND source.col2 <> target.col2 ....`.Be aware of nullable columns. You can use `(s.col1 <> t.col1 OR s.col1 IS NULL AND t.col1 IS NOT NULL OR s.col1 IS NOT NULL AND t.col1 IS NULL)` – Jesús López Nov 26 '19 at 17:53