i'm trying to update a deltalake table using a spark dataframe. What i want to do is to update all rows that are different in the spark dataframe than in the deltalake table, and to insert all rows that are missing from the deltalake table.
I tried to do this as follows:
import io.delta.tables._
val not_equal_string = df.schema.fieldNames.map(fn =>
s"coalesce(not ((updates.${fn} = history.${fn}) or (isnull(history.${fn}) and isnull(updates.${fn})) ),false)"
).reduceLeft((x,y) => s"$x OR $y ")
val deltaTable = DeltaTable.forPath(spark, "s3a://sparkdata/delta-table")
deltaTable.as("history").merge(
df.as("updates"), "updates.EquipmentKey = history.EquipmentKey"
).whenMatched(not_equal_string).updateAll().whenNotMatched().insertAll().execute()
this works but when i look in the resulting delta table i see that it effectively doubled in size even if i didn't update a single record. A new json file was generated with a remove for every old partition and an add with all new partitions.
when i just run a sql join with the whenMatched criterion as a where condition, i don't get a single row.
i would expect the delta table to be untouched after such a merge operation. am i missing something simple ?