0

I have a source table with say following data

+----------------+---+--------+-----------------+---------+
|registrationDate| id|custName|            email|eventName|
+----------------+---+--------+-----------------+---------+
|      17-02-2023|  2| Person2|person2@gmail.com|   INSERT|
|      17-02-2023|  1| Person1|person1@gmail.com|   INSERT|
|      17-02-2023|  5| Person5|person5@gmail.com|   INSERT|
|      17-02-2023|  4| Person4|person4@gmail.com|   INSERT|
|      17-02-2023|  3| Person3|person3@gmail.com|   INSERT|
+----------------+---+--------+-----------------+---------+

the above table is being stored into my S3 in delta format. Now I'm creating a dataframe from Kinesis streams , and trying to merge it into my delta table. Every operation works fine - upserts, deletes everything but let's say the dataframe generated from the stream looks something like below -

+---------+---+---------------+----------------+----------------+
|eventName|id |custName       |email           |registrationDate|
+---------+---+---------------+----------------+----------------+
|REMOVE   |1  |null           |null            |null            |
|MODIFY   |2  |ModPerson2     |modemail@mod.com|09-02-2023      |
|MODIFY   |3  |3modifiedperson|modp@mod.com    |09-02-2023      |
|INSERT   |100|Person100      |p100@p.com      |09-02-2023      |
|INSERT   |200|Person200      |p200@p.com      |09-02-2023      |
|REMOVE   |5  |null           |null            |null            |
|REMOVE   |200|null           |null            |null            |
+---------+---+---------------+----------------+----------------+

it is evident from the dataframe above created by data streams that i'm inserting a record with ID 200 while also deleting that record with the same id in the same batch.The records with ID 1 and 5 are deleted but not 200. Merging is not possible in this case of duplication to the delta table. How do i counter this ?

deltaTable = DeltaTable.forPath(spark, 's3://path-to-my-s3')


(deltaTable.alias("first_df").merge(
    updated_data.alias("append_df"),
    "first_df.id = append_df.id") 
  .whenMatchedDelete("append_df.eventName='REMOVE'")
  .whenMatchedUpdateAll("first_df.id = append_df.id")
  .whenNotMatchedInsertAll()
  .execute()
)

Resulting Delta table after Merge - Id with 200 still remains

+----------------+---+---------------+-----------------+---------+
|registrationDate| id|       custName|            email|eventName|
+----------------+---+---------------+-----------------+---------+
|      09-02-2023|  2|     ModPerson2| modemail@mod.com|   MODIFY|
|      17-02-2023|  4|        Person4|person4@gmail.com|   INSERT|
|      09-02-2023|100|      Person100|       p100@p.com|   INSERT|
|      09-02-2023|200|      Person200|       p200@p.com|   INSERT|
|            null|200|           null|             null|   REMOVE|
|      09-02-2023|  3|3modifiedperson|     modp@mod.com|   MODIFY|
+----------------+---+---------------+-----------------+---------+
Mahesh M
  • 69
  • 1
  • 10

0 Answers0