1

Goal: I want to update an existing column in a Delta Lake table with a periodically run Spark job A while being able to run another periodical Spark job B that adds new data, without suffering data loss.

Problem: As far as I know I need to use SaveMode.Overwrite to update existing data. This however would result in data loss if another Spark job runs in the meanwhile adding new data, because the overwrite job operates on a Delta table version that might be outdated at the time the actual overwrite occurs. In my tests this resulted in data loss for the new data.

Is there a way to achieve my goal without using a second table for the updated data? I'm aware of the possibility of using a second, enriched and sanitized table, but would rather operate on one table for now.

Spark job A to update existing data would be similar to:

// Code is in Kotlin not Scala, but that shouldn't matter
spark
        .read()
        .format("delta")
        .load("path-to-table")
        .withColumn("enriched", "enriched_data") // or any other operation altering the state
        .write()
        .format("delta")
        .mode("overwrite")
        .save("path-to-table")

Spark job A is a structured streaming job that appends new data to the table.

Daniel Müller
  • 426
  • 1
  • 5
  • 19
  • 1
    You can use [`MERGE`](https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-using-merge) for the job that updates data. And write using `append` mode for the job that appends new data. – blackbishop Mar 01 '21 at 09:50
  • Thanks, that was exactly what I was looking for. – Daniel Müller Mar 01 '21 at 10:54
  • 1
    You need to make sure that the two jobs are accessing different partitions, otherwise you will get conflicts, see https://docs.databricks.com/delta/concurrency-control.html – hbrgnr Mar 16 '21 at 17:07

0 Answers0