0

I am running everything in databricks. (everything is under the assumption that the data is pyspark dataframe)

The scenario is: I have 40 files read as delta files in ADLS n then apply a series of transformation function(thru loop FIFO flow). At last, write as delta files in ADLS.

df.write.format("delta").mode('append').save(...)

For each file, its about 10k rows and the whole process time takes about 1 hour.

I am curious if anyone can answer the question as below:

  1. is loop a good approach to apply those transformations? is there better way to parallelly applying those functions to all files at once?
  2. what is the common avg time to load delta table for a file with 10k row?
  3. any suggestion for me to improve the performance?
  • 1. Can you show your transformation codes? What is spec of your worker type of the cluster? 2. I am working with nearly 100 million records without any performance issues (it is taking about a few minutes to load and write), so for me, this seems to be a problem with transformation or infrastructure. 3. you may tune up your transformation logic, or use higher cluster spec. – Phuri Chalermkiatsakul Apr 29 '22 at 06:54
  • I am appending 200k records per second to delta table and have no problem. Make sure you run optimize with vacuum on your table. – Dariusz Krynicki Oct 15 '22 at 20:35

1 Answers1

0

You said you run all in Databricks. Assuming you are using latest version of delt:

  1. Set delta.autoCompact
  2. set shuffle partitions to auto
  3. Set delta.deletedFileRetentionDuration
  4. Set delta.logRetentionDuration
  5. When you write DF use partitionBy
  6. When you write DF you may want to reparation but don't have you
  7. You may want to set maxRecordsPerFile in your writer options
  8. Show us the code as it seems like your processing code is bottleneck.
Dariusz Krynicki
  • 2,544
  • 1
  • 22
  • 47