Delta table merge on multiple columns

Question

i have a table which has primary key as multiple columns so I need to perform the merge logic on multiple columns


DeltaTable.forPath(spark, "path")
  .as("data")
  .merge(
    finalDf1.as("updates"),
    "data.column1 = updates.column1 AND data.column2 = updates.column2 AND data.column3 = updates.column3 AND data.column4 = updates.column4 AND data.column5 = updates.column5")
  .whenMatched
  .updateAll()
  .whenNotMatched
  .insertAll()
  .execute()

When I check the data counts it is not updating as expected.

Could someone help me here on this?

can you take one of the input rows that should match and you do `select * from table condition_on_all_primary_key` for some of the input rows - do you see them as a new rows? Also, what is the DBR version? — Alex Ott, Aug 31 '21 at 09:58
Hi, The Duplication is occurring only for 20 to 30 rows. i.e I'm getting 5320 by the above merging code But my actual count is 5302 with the unique fields. — Tony, Sep 03 '21 at 06:40

score 0 · Answer 1 · answered Sep 10 '21 at 14:22

Please try also approach like in this example: https://docs.databricks.com/_static/notebooks/merge-in-cdc.html Create a changes tables with additional columns which you will note

if a row is new (be inserted)
old (primary key exists) and nothing has changed
old (primary key exists) but other fields needs an update and then use additional conditions on merge, for example:

.whenMatched("s.new = true")
.insert()
.whenMatched("s.updated = true")
.updateExpr(Map("key" -> "s.key", "value" -> "s.newValue"))

score 0 · Answer 2 · answered Oct 20 '21 at 18:29

How are you counting your rows?

One thing to keep in mind is that directly reading and counting from the parquet files produced by Delta Lake will potentially give you a different result than reading the rows through the delta table interface. Remember that delta keeps a log and supports time travel so it does store copies of rows as they change over time.

Here's a way to accurately count the current rows in a delta table:

deltaTable = DeltaTable.forPath(spark,<path to your delta table>)

deltaTable.toDF().count()

Delta table merge on multiple columns

2 Answers2