9

i have a table which has primary key as multiple columns so I need to perform the merge logic on multiple columns


DeltaTable.forPath(spark, "path")
  .as("data")
  .merge(
    finalDf1.as("updates"),
    "data.column1 = updates.column1 AND data.column2 = updates.column2 AND data.column3 = updates.column3 AND data.column4 = updates.column4 AND data.column5 = updates.column5")
  .whenMatched
  .updateAll()
  .whenNotMatched
  .insertAll()
  .execute()

When I check the data counts it is not updating as expected.

Could someone help me here on this?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Tony
  • 301
  • 3
  • 10
  • can you take one of the input rows that should match and you do `select * from table condition_on_all_primary_key` for some of the input rows - do you see them as a new rows? Also, what is the DBR version? – Alex Ott Aug 31 '21 at 09:58
  • Hi, The Duplication is occurring only for 20 to 30 rows. i.e I'm getting 5320 by the above merging code But my actual count is 5302 with the unique fields. – Tony Sep 03 '21 at 06:40

2 Answers2

0

Please try also approach like in this example: https://docs.databricks.com/_static/notebooks/merge-in-cdc.html Create a changes tables with additional columns which you will note

  • if a row is new (be inserted)
  • old (primary key exists) and nothing has changed
  • old (primary key exists) but other fields needs an update and then use additional conditions on merge, for example:
.whenMatched("s.new = true")
.insert()
.whenMatched("s.updated = true")
.updateExpr(Map("key" -> "s.key", "value" -> "s.newValue"))
chomar.c
  • 61
  • 5
0

How are you counting your rows?

One thing to keep in mind is that directly reading and counting from the parquet files produced by Delta Lake will potentially give you a different result than reading the rows through the delta table interface. Remember that delta keeps a log and supports time travel so it does store copies of rows as they change over time.

Here's a way to accurately count the current rows in a delta table:

deltaTable = DeltaTable.forPath(spark,<path to your delta table>)

deltaTable.toDF().count()