I eventually took the following approach:
- Backup the original table to be on the safe-side
- spark.read.table into memory
- df.write.option(“overwriteSchema”, “true”) to the original location
I chose this approach so I don’t need to change the original data location.
In more details:
1. Backup the original table to be on the safe-side
Since this was in on Databricks I could use their proprietary deep clone
feature:
create table mydb.mytable_backup_before_schema_migration_v1
deep clone mydb.mytable
location 'dbfs:/mnt/defaultDatalake/backups/zones/mydb/mytable_backup_before_schema_migration_v1'
If you are not in Databricks and don't have access to its deep clone
, you still can backup the table by reading and writing a copy to another place.
2. read and 3. overwrite with new schema
val df = spark.read.format("delta").table("mydb.mytable")
df
.write
.format("delta")
.mode("overwrite")
.partitionBy("part_a", "part_b")
.option("overwriteSchema", "true")
.saveAsTable("mydb.mytable") // same table, same location, but different data physical organization because partition changes