Goal: I want to update an existing column in a Delta Lake table with a periodically run Spark job A while being able to run another periodical Spark job B that adds new data, without suffering data loss.
Problem: As far as I know I need to use SaveMode.Overwrite to update existing data. This however would result in data loss if another Spark job runs in the meanwhile adding new data, because the overwrite job operates on a Delta table version that might be outdated at the time the actual overwrite occurs. In my tests this resulted in data loss for the new data.
Is there a way to achieve my goal without using a second table for the updated data? I'm aware of the possibility of using a second, enriched and sanitized table, but would rather operate on one table for now.
Spark job A to update existing data would be similar to:
// Code is in Kotlin not Scala, but that shouldn't matter
spark
.read()
.format("delta")
.load("path-to-table")
.withColumn("enriched", "enriched_data") // or any other operation altering the state
.write()
.format("delta")
.mode("overwrite")
.save("path-to-table")
Spark job A is a structured streaming job that appends new data to the table.