How to physically partition data to avoid shuffle in Spark SQL joins

Question

I have a requirement to join 5 medium size tables (~80 gb each) with a big Input data ~ 800 gb. All data resides in HIVE tables. I am using Spark SQL 1.6.1 for achieving this. Join is taking 40 mins of time to complete with --num-executors 20 --driver-memory 40g --executor-memory 65g --executor-cores 6. All joins are sort merge outer joins. Also seeing a lot of shuffle happening.

I bucketed all tables in hive into same number of buckets so that similar keys from all tables will go to same spark partitions while loading data itself at first. But it seems spark does not understand bucketing.

Is there any other way i can physically partition & sort data in Hive (no of part files) so that spark will know about partitioning keys while loading data from hive itself and do a join with in the same partitioning without shuffling data around? This will avoid additional re-Partitioning after loading data from hive.

Saurabh · Answer 1 · 2017-01-17T13:07:40.583

First of all Spark Sql 1.6.1 doesn't support hive buckets yet. So in this case we are left with Spark level operations ensuring that all tables must go to same spark partitions while loading the data. Spark API provides repartition and sortWithinPartitions to achieve the same. e.g

val part1 = df1.repartition(df1("key1")).sortWithinPartitions(df1("key1"))

In the same way you can go for the generations of partitions for remaining tables and joined them on the key which was sorted within partitions.

This will make the join "shuffle free" operation but come with major computational cost. Caching the dataframes(you can go for cache operation for newly created partition) perform better if the operation will be performed subsequent times. Hope this help.

How to physically partition data to avoid shuffle in Spark SQL joins

1 Answers1

Linked