I have a requirement to join 5 medium size tables (~80 gb each) with a big Input data ~ 800 gb. All data resides in HIVE tables.
I am using Spark SQL 1.6.1 for achieving this.
Join is taking 40 mins of time to complete with
--num-executors 20 --driver-memory 40g --executor-memory 65g --executor-cores 6
. All joins are sort merge outer joins. Also seeing a lot of shuffle happening.
I bucketed all tables in hive into same number of buckets so that similar keys from all tables will go to same spark partitions while loading data itself at first. But it seems spark does not understand bucketing.
Is there any other way i can physically partition & sort data in Hive (no of part files) so that spark will know about partitioning keys while loading data from hive itself and do a join with in the same partitioning without shuffling data around? This will avoid additional re-Partitioning after loading data from hive.