We are using delta (.io) for our data lake. Every X hours we want to upsert all record that are new/changed.
Our initial code looks like this:
from delta.tables import *
for table in output_tables.keys():
update_condition = "old." + output_table_unique_keys[table] + " = new." + output_table_unique_keys[table]
new_df = output_tables[table].drop_duplicates([output_table_unique_keys[table]])
old_df = DeltaTable.forPath(spark, des_file_path + table)
old_df.alias("old").merge(new_df.alias("new"), update_condition) \
.whenMatchedUpdateAll()\
.whenNotMatchedInsertAll() \
.execute()
Currently the update_condition
is old.unique_key = new.unique_key
but we come across some errors. We are prone to the ConcurrentAppendException
, see https://docs.databricks.com/delta/concurrency-control.html.
The example shows us this code as a solution
// Target 'deltaTable' is partitioned by date and country
deltaTable.as("t").merge(
source.as("s"),
"s.user_id = t.user_id AND s.date = t.date AND s.country = t.country")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
So in order to implement the solution for this we need to make our update_condition
more explicit. We want to do this by adding the following statement (our partitions).
old.unique_key = new.unique_key AND new.Division = <Division> AND new.company = <company>
Just like in the scala code in the example (from the link) we have our value of our new statement in <>
. How does this work in Python? How can I dynamically replace that value with the value of each corresponding row?