I have a source table with say following data
+----------------+---+--------+-----------------+---------+
|registrationDate| id|custName| email|eventName|
+----------------+---+--------+-----------------+---------+
| 17-02-2023| 2| Person2|person2@gmail.com| INSERT|
| 17-02-2023| 1| Person1|person1@gmail.com| INSERT|
| 17-02-2023| 5| Person5|person5@gmail.com| INSERT|
| 17-02-2023| 4| Person4|person4@gmail.com| INSERT|
| 17-02-2023| 3| Person3|person3@gmail.com| INSERT|
+----------------+---+--------+-----------------+---------+
the above table is being stored into my S3 in delta format. Now I'm creating a dataframe from Kinesis streams , and trying to merge it into my delta table. Every operation works fine - upserts, deletes everything but let's say the dataframe generated from the stream looks something like below -
+---------+---+---------------+----------------+----------------+
|eventName|id |custName |email |registrationDate|
+---------+---+---------------+----------------+----------------+
|REMOVE |1 |null |null |null |
|MODIFY |2 |ModPerson2 |modemail@mod.com|09-02-2023 |
|MODIFY |3 |3modifiedperson|modp@mod.com |09-02-2023 |
|INSERT |100|Person100 |p100@p.com |09-02-2023 |
|INSERT |200|Person200 |p200@p.com |09-02-2023 |
|REMOVE |5 |null |null |null |
|REMOVE |200|null |null |null |
+---------+---+---------------+----------------+----------------+
it is evident from the dataframe above created by data streams that i'm inserting a record with ID 200 while also deleting that record with the same id in the same batch.The records with ID 1 and 5 are deleted but not 200. Merging is not possible in this case of duplication to the delta table. How do i counter this ?
deltaTable = DeltaTable.forPath(spark, 's3://path-to-my-s3')
(deltaTable.alias("first_df").merge(
updated_data.alias("append_df"),
"first_df.id = append_df.id")
.whenMatchedDelete("append_df.eventName='REMOVE'")
.whenMatchedUpdateAll("first_df.id = append_df.id")
.whenNotMatchedInsertAll()
.execute()
)
Resulting Delta table after Merge - Id with 200 still remains
+----------------+---+---------------+-----------------+---------+
|registrationDate| id| custName| email|eventName|
+----------------+---+---------------+-----------------+---------+
| 09-02-2023| 2| ModPerson2| modemail@mod.com| MODIFY|
| 17-02-2023| 4| Person4|person4@gmail.com| INSERT|
| 09-02-2023|100| Person100| p100@p.com| INSERT|
| 09-02-2023|200| Person200| p200@p.com| INSERT|
| null|200| null| null| REMOVE|
| 09-02-2023| 3|3modifiedperson| modp@mod.com| MODIFY|
+----------------+---+---------------+-----------------+---------+