How to handle crash ( due to any reason ) during delta table merge on the target delta table ? Will it creates duplicate records if i re-ran partial failed (crash) merge command on the same target delta table but source delta table ( (some records have been updated or inserted) ) gets updated each time.
Assume the following scenario : (Note: Each record contains the primary key based on which merge is performed.)
In Run-1, we have source S1 with 10 records ( 5- update , 5- insert ) and Target with 100 records after merge, target delta table will contain 105 records.
In Run-2, we have source S2 with 5 records ( 2-update , 3- insert) and Target delta table with 105 records and merge have failed after writing partially ( 2 records have been inserted and 1 record have been updated ).
As above Run-2 have failed, we re-ran the job again. But this time S2 (source delta table got updated) will contain 7 records ( 3-update,4-insert) which includes the previous 5 records (2- update , 3- insert) + current 2 records ( 1 -update, 1- insert ) .
Note : In re-ran of Run-2 , all 7 records are unique.
So if perform merge again on the target delta table, will it insert the already inserted records and create duplicates ?
Note : Currently txnVersion & txnAppId is supported for overwrite, append , streaming merge.
Version :
delta version : 2.3.0
spark version : 3.3.2
hadoop version : 3.3
Any help on the above issue ?