How to choose the optimal repartition value in spark

Question

I have 3 input files File1 - 27gb File2 - 3gb File3 - 12mb

My cluster configuration 2 executor Each executor has 2 cores Executor memory - 13gb (2gb overhead)

The transformation that I'm going to perform is left join, in which the left table is file1 and right tables are file2 and file3

I need to repartition the file1 and file2 to optimal number of partitions so that it don't waste the time/resources.

Thanks in advance

score 0 · Answer 1 · answered Oct 06 '22 at 17:45

You are not writing about any other transformations so i am assuming that you want to create very simple job which is only performing this one join

You are not asking about file3 so i am assuming that you are going to broadcast it with hint and this is a good direction.

If you are not doing anything before this join i am not sure if this is worth to repartition file1/file2 because most probably they are going to be joined with SMJ (sort merge join - its shuffling both datasets based on column from join condition) and output df from this join will have number of partitions equals to spark.sql.shuffle.partitions so you may try to tune this parameter (this will affect also other shuffles so keep in mind my assumption from first line)

You may try to adjust this parameter to bigger dataset (file1) to create partitions around 100-200 mb. I think its worth to read this blog post: Medium blog post

How to choose the optimal repartition value in spark

1 Answers1