I have the following data sets:
Dataset 1: Dataset 2: Dataset 3:
id field1 l_id r_id id field2
Here are their sizes: Dataset1: 20G Dataset2: 5T Dataset3: 20G
Goal: I would like to join all these data sets on the id field(l_id with id from Dataset1 and r_id with id from Dataset 3) with the final dataset to look like:
l_id r_id field1 field2
My Current approach: Join Dataset1 and Dataset2 (on id and l_id) to produce (l_id r_id field1) and then join this with Dataset3 (on r_id and id) to produce (l_id r_id field1 field2) I am assuming spark automatically uses a hash partitioner looking at the fields being joined. This approach, though, causes one of the executors to go out of disk space probably due to the amount of shuffling.
Can you suggest how can i go about joining these data sets ? Is my understanding that spark uses a hash partitioner by default, looking at the columns being joined correct ? Or should i have to manually partition the data first and then perform the joins ?
Please note that broadcasting Dataset1/2 isn't an option as they are too big and might get even big in the future. Also, all the datasets are non-key value RDDs and contain more fields than those listed in here. So i am not sure how the default partitioning works and how can i configure a custom partitioner.
Thanks.
Update 1:
I am using hive SQL to perform all the joins with spark.sql.shuffle.partitions set to 33000 and the following configuration:
sparkConf.set("spark.akka.frameSize", "500")
sparkConf.set("spark.storage.memoryFraction", "0.2")
sparkConf.set("spark.network.timeout", "1200")
sparkConf.set("spark.yarn.scheduler.heartbeat.interval-ms", "10000")
sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.driver.maxResultSize", "0")
sparkConf.set("spark.shuffle.consolidateFiles", "true")
I also do have control on how all these datasets are generated. None of them seem to have a partitioner set (by looking at rdd.partitioner) and I don't see any API in SQLContext which will let me configure a partitioner when creating a data frame.
I am using scala and Spark 1.3