merge into deltalake table updates all rows

Question

i'm trying to update a deltalake table using a spark dataframe. What i want to do is to update all rows that are different in the spark dataframe than in the deltalake table, and to insert all rows that are missing from the deltalake table.

I tried to do this as follows:

import io.delta.tables._

val not_equal_string = df.schema.fieldNames.map(fn => 
    s"coalesce(not ((updates.${fn} = history.${fn}) or (isnull(history.${fn}) and isnull(updates.${fn})) ),false)"
    ).reduceLeft((x,y) => s"$x OR $y ")

val deltaTable = DeltaTable.forPath(spark, "s3a://sparkdata/delta-table")

deltaTable.as("history").merge(
    df.as("updates"), "updates.EquipmentKey = history.EquipmentKey"
).whenMatched(not_equal_string).updateAll().whenNotMatched().insertAll().execute()

this works but when i look in the resulting delta table i see that it effectively doubled in size even if i didn't update a single record. A new json file was generated with a remove for every old partition and an add with all new partitions.

when i just run a sql join with the whenMatched criterion as a where condition, i don't get a single row.

i would expect the delta table to be untouched after such a merge operation. am i missing something simple ?

I'm encountering the same issue. Did you end up finding a solution? — nr-91, Jan 14 '21 at 18:42
no ... but i never tested deltalake on databricks. i think there are quite some differences between the OSS version and the databricks version, so it might be worth trying on databricks. also, there are alternatives like apache hudi. — Frank Dekervel, Mar 01 '21 at 08:21

merge into deltalake table updates all rows

0 Answers0