0

We are using delta (.io) for our data lake. Every X hours we want to upsert all record that are new/changed.

Our initial code looks like this:

from delta.tables import *

for table in output_tables.keys():
  update_condition = "old." + output_table_unique_keys[table] + " = new." + output_table_unique_keys[table]
  new_df = output_tables[table].drop_duplicates([output_table_unique_keys[table]])
  old_df = DeltaTable.forPath(spark, des_file_path + table)

  old_df.alias("old").merge(new_df.alias("new"), update_condition) \
  .whenMatchedUpdateAll()\
  .whenNotMatchedInsertAll() \
  .execute()

Currently the update_condition is old.unique_key = new.unique_key but we come across some errors. We are prone to the ConcurrentAppendException, see https://docs.databricks.com/delta/concurrency-control.html.

The example shows us this code as a solution

// Target 'deltaTable' is partitioned by date and country
deltaTable.as("t").merge(
    source.as("s"),
    "s.user_id = t.user_id AND s.date = t.date AND s.country = t.country")
  .whenMatched().updateAll()
  .whenNotMatched().insertAll()
  .execute()

So in order to implement the solution for this we need to make our update_condition more explicit. We want to do this by adding the following statement (our partitions).

old.unique_key = new.unique_key AND new.Division = <Division> AND new.company = <company>

Just like in the scala code in the example (from the link) we have our value of our new statement in <>. How does this work in Python? How can I dynamically replace that value with the value of each corresponding row?

54m
  • 719
  • 2
  • 7
  • 18
  • 1
    The Python code works similar to how the Scala code works, some good references for this are [Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python APIs](https://databricks.com/blog/2019/10/03/simple-reliable-upserts-and-deletes-on-delta-lake-tables-using-python-apis.html) and [Upsert into a table using merge](https://docs.delta.io/0.8.0/delta-update.html#upsert-into-a-table-using-merge). In terms of dynamically changing the values, you could run multiple statements and loop through all the values but could would it be possible to generate a hash of the keys to avoid this? – Denny Lee May 12 '21 at 04:40
  • @DennyLee Looping will take too long since it's a lot of data. Since I couldn't find a fix (not even in delta repository) I ended up with using the same method `old.X = new.X AND old.y = new.y`, but this time adding a retry mechanism whenever it encounters a `ConcurrentAppendException`. – 54m May 12 '21 at 13:16
  • Is it possible that the `target` table contains multiple updates to the same row thus causing the `ConcurrentAppendException`? – Denny Lee May 12 '21 at 16:16

1 Answers1

0

This solution could be extrapolated to your situation.

First I created a date variable

#Update active values
active_date = datetime.today() + relativedelta.relativedelta(months=0, day=1)
active_date = active_date.strftime('%Y-%m-%d')
print(active_date)

I inputted this variable as a conditional to update my delta table using the following code.

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, '/tmp/delta/people')

# Declare the predicate by using a SQL-formatted string.
deltaTable.update(
  condition = "startDate != " + active_date,
  set = { "gender": "'Female'" }
)

As you can see the conditional can be updated accordingly to the current date. To alter another column. In this case the gender column.