1

I'm using Delta Lake 0.4.0 with Merge like:

target.alias("t")
          .merge(
            src.as("s"),
            "s.id = t.id 
          )
          .whenMatched().updateAll()
          .whenNotMatched().insertAll()
          .execute()

src reads from a folder with thousands of files. The merge results generate many small files too. Is there a way to control the file number in merge results like effect of repartition(1) or coalesce(1)?

Thanks

processadd
  • 101
  • 3
  • 5

2 Answers2

0

There isn't a way to control the number of files in a Delta output operation. Instead, use OPTIMIZE at the appropriate time or, on platforms such as Databricks, take advantage of auto-optimization.

Sim
  • 13,147
  • 9
  • 66
  • 95
  • hmm, we are using open source delta lake, so we cannot use that, right? We load small real-time data files to delta lake. We expect to avoid another compression. Maybe we have to split the merge to update and insert by ourselves. – processadd Nov 19 '19 at 17:42
  • If you split it, you'd lose atomicity, which is the key feature of Delta. I'd strongly recommend against it. You can run OPTIMIZE on a schedule. – Sim Nov 19 '19 at 18:36
  • Thanks. I thought about the atomicity. Say we do update first and then insert. If update succeeded but insert failed, next time it will do update and insert again for the same data. It should be idempotent (need to fully test this). The thing is with merge, for 500 files (around 1GB), it generates many small files ~4MB. Not sure optimize works for open source delta lake. – processadd Nov 19 '19 at 19:53
  • The issue with atomicity is that between the update and the insert the table is an in inconsistent state and, unless you take explicit steps to prevent it, via the nature of how you solve the problem, via a workflow server or a (cross-cluster) locking system, an operation can use that inconsistent state. In other words, without atomicity, idempotency is only guaranteed assuming a specific order of operations, which you have to ensure. When you have atomicity, you don't have to worry about order of operations. (Idempotency is always the way to go.) – Sim Nov 20 '19 at 20:23
  • Hi Sim, auto-optimization you mentioned is not in open source delta lake yet, right? I tried to add spark.databricks.delta.optimizeWrite.enabled and spark.databricks.delta.autoCompact.enabled, but seems not take effect. – processadd Nov 25 '19 at 18:21
  • Yup, saw that. If you can pause stream writes you can rewrite the table using `df.repartition()` and Delta's ACID I/O. That's manual compaction. – Sim Nov 27 '19 at 08:33
0

According to https://docs.delta.io/latest/delta-update.html#performance-tuning you can now set spark.delta.merge.repartitionBeforeWrite to true for avoiding that.