0

I have a table that ingests new data every day through a merge. I'm currently trying to migrate from ORC to Delta file format and I stumbled through a problem when processing the following simple Merge operation:

DeltaTable
      .forPath(sparkSession, deltaTablePath)
      .alias(SOURCE)
      .merge(
        rawDf.alias(RAW),
        joinClause // using primary keys and when possible partition keys. Not important here I guess
      )
      .whenNotMatched().insertAll()
      .whenMatched(dedupConditionUpdate).updateAll()
      .whenMatched(dedupConditionDelete).delete()
      .execute()

When the merge is done, every impacted partition has hundreds of new files. As there is one new ingestion per day, this behaviour makes every following merge operation slower and slower.

Versions:

  • Spark : 2.4
  • DeltaLake: 0.6.1

Is there a way to repartition before saving ? or any other way to improve this ?

Ismail H
  • 4,226
  • 2
  • 38
  • 61

2 Answers2

2

After searching a bit in Delta core's code, there is an option that does repartition on write :

spark.conf.set("spark.databricks.delta.merge.repartitionBeforeWrite.enabled", "true")
Ismail H
  • 4,226
  • 2
  • 38
  • 61
1

you should set delta.autoOptimize.autoCompact property on table for auto compaction.

following page shows how you can set at for existing and new table.

https://docs.databricks.com/delta/optimizations/auto-optimize.html

Gaurang Shah
  • 11,764
  • 9
  • 74
  • 137
  • Sorry I haven't precised the version of Delta I'm using... it's 0.6.1 since it's Spark 2.4. Unfortunatly it doesn't seem like this feature is available in this version (I've tested it) – Ismail H Jun 16 '22 at 15:29
  • so then other option is run compaction manually after your job completes. – Gaurang Shah Jun 16 '22 at 19:42
  • 1
    The thing is, manual compaction takes a lot of time. But I found an option that does this in Delta core's code : `spark.conf.set("spark.databricks.delta.merge.repartitionBeforeWrite.enabled", "true")` – Ismail H Jun 17 '22 at 05:47