I am running everything in databricks. (everything is under the assumption that the data is pyspark dataframe)
The scenario is:
I have 40 files read as delta files in ADLS n then apply a series of transformation function(thru loop FIFO flow). At last, write as delta files in ADLS.
df.write.format("delta").mode('append').save(...)
For each file, its about 10k rows and the whole process time takes about 1 hour.
I am curious if anyone can answer the question as below:
- is loop a good approach to apply those transformations? is there better way to parallelly applying those functions to all files at once?
- what is the common avg time to load delta table for a file with 10k row?
- any suggestion for me to improve the performance?