0

How to handle crash ( due to any reason ) during delta table merge on the target delta table ? Will it creates duplicate records if i re-ran partial failed (crash) merge command on the same target delta table but source delta table ( (some records have been updated or inserted) ) gets updated each time.

Assume the following scenario : (Note: Each record contains the primary key based on which merge is performed.)

In Run-1, we have source S1 with 10 records ( 5- update , 5- insert ) and Target with 100 records after merge, target delta table will contain 105 records.

In Run-2, we have source S2 with 5 records ( 2-update , 3- insert) and Target delta table with 105 records and merge have failed after writing partially ( 2 records have been inserted and 1 record have been updated ).

As above Run-2 have failed, we re-ran the job again. But this time S2 (source delta table got updated) will contain 7 records ( 3-update,4-insert) which includes the previous 5 records (2- update , 3- insert) + current 2 records ( 1 -update, 1- insert ) .

Note : In re-ran of Run-2 , all 7 records are unique.

So if perform merge again on the target delta table, will it insert the already inserted records and create duplicates ?

Note : Currently txnVersion & txnAppId is supported for overwrite, append , streaming merge.

Version :

delta version : 2.3.0
spark version : 3.3.2
hadoop version : 3.3

Any help on the above issue ?

Kunfu Panda
  • 57
  • 1
  • 2
  • 8

1 Answers1

0

I assume that both 5 records in the 1st Run-2 process in 1 batch and transaction. As Delta Lake Table is designed as ACID guarantees, when both your 5 records are updated but partially failed, the whole transaction will not be committed and both 5 records will not be updated in the table. You can check the their write protocol in the paper - 3.2.2 Write Transactions.

However, when you rerun the Run-2 again, there're totally 7 records with duplication records:

  1. Insert: There are 2 duplicated records with same primary key. As 1st Run-2 is not committed, both 2 records should be inserted and duplicated.
  2. Update: There are 2 duplicated records with same primary key. Since there're 2 records with same key trying to update the record in target table, even though the 2 records are exactly the same, it will cause error with conflict.

Since the update part will cause error, even when you rerun the Run-2 again, the transaction and all 7 records will not be processed.

--- Edit 1 - 2023-07-04 - Answer to questions in comment:

  1. When you insert and update any records in each transaction, it will create parquets since Delta Lake apply COPY-ON-WRITE(CoW) when doing UPSERT and delta log control which parquets you will read when you read the table. You can check the 3.2.1 Reading from Tables.

  2. Although there is no difference when you read the table, idempotent writes provide different advantages. One of the advantages is that you can speed up the pipeline when you have to rerun the job. Let say you have a job, which has 100 transaction in this job and it's failed in the 95th transaction update. When you rerun this job, since idempotent writes is applied, the job will only process the remaining transaction. Also, it can speed up the table compaction process since little parquet file need to be compacted.

  3. Is the idempotent writes are only for append & overwrite mode -> I'm no sure on this part.

  4. I think it's duplicated when you didn't set any checkpoint for your streaming. If you don't have checkpoint for your streaming, same batch of data will be processed if you rerun the job. I think this is what it means.

Jonathan Lam
  • 1,761
  • 2
  • 8
  • 17
  • In re-ran of run-2, it does not contain duplicate records ( all 7 records our unique ) – Kunfu Panda Jul 03 '23 at 07:46
  • @KunfuPanda If all 7 records are unique, then there should be no error and duplication, since the 1st run-2 upsert operation will not be committed. – Jonathan Lam Jul 03 '23 at 10:22
  • Yes i have tested the scenario locally and it is working fine, but i can see some extra parquet files have been written to the location. 1. Does it affect anything ? And if that the case why do we need idempotent writes (https://docs.delta.io/latest/delta-batch.html#idempotent-writes ) 2. Is the idempotent writes are only for append & overwrite mode ? 3. As per the doc , duplication can happen in merge. Please check the note section at the bottom, it has mentioned the case for streaming (https://docs.delta.io/latest/delta-update.html) ? – Kunfu Panda Jul 03 '23 at 12:04
  • @KunfuPanda I've update the answer above. You can see if it helps. – Jonathan Lam Jul 04 '23 at 01:55