I have a table that ingests new data every day through a merge. I'm currently trying to migrate from ORC
to Delta
file format and I stumbled through a problem when processing the following simple Merge
operation:
DeltaTable
.forPath(sparkSession, deltaTablePath)
.alias(SOURCE)
.merge(
rawDf.alias(RAW),
joinClause // using primary keys and when possible partition keys. Not important here I guess
)
.whenNotMatched().insertAll()
.whenMatched(dedupConditionUpdate).updateAll()
.whenMatched(dedupConditionDelete).delete()
.execute()
When the merge is done, every impacted partition has hundreds of new files. As there is one new ingestion per day, this behaviour makes every following merge operation slower and slower.
Versions:
- Spark : 2.4
- DeltaLake: 0.6.1
Is there a way to repartition before saving ? or any other way to improve this ?