1

I have the following data sets:

Dataset 1:                 Dataset 2:                   Dataset 3:
id  field1                 l_id    r_id                 id field2

Here are their sizes: Dataset1: 20G Dataset2: 5T Dataset3: 20G

Goal: I would like to join all these data sets on the id field(l_id with id from Dataset1 and r_id with id from Dataset 3) with the final dataset to look like:

l_id     r_id     field1      field2

My Current approach: Join Dataset1 and Dataset2 (on id and l_id) to produce (l_id r_id field1) and then join this with Dataset3 (on r_id and id) to produce (l_id r_id field1 field2) I am assuming spark automatically uses a hash partitioner looking at the fields being joined. This approach, though, causes one of the executors to go out of disk space probably due to the amount of shuffling.

Can you suggest how can i go about joining these data sets ? Is my understanding that spark uses a hash partitioner by default, looking at the columns being joined correct ? Or should i have to manually partition the data first and then perform the joins ?

Please note that broadcasting Dataset1/2 isn't an option as they are too big and might get even big in the future. Also, all the datasets are non-key value RDDs and contain more fields than those listed in here. So i am not sure how the default partitioning works and how can i configure a custom partitioner.

Thanks.

Update 1:

I am using hive SQL to perform all the joins with spark.sql.shuffle.partitions set to 33000 and the following configuration:

sparkConf.set("spark.akka.frameSize", "500")
sparkConf.set("spark.storage.memoryFraction", "0.2")
sparkConf.set("spark.network.timeout", "1200")
sparkConf.set("spark.yarn.scheduler.heartbeat.interval-ms", "10000")
sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.driver.maxResultSize", "0")
sparkConf.set("spark.shuffle.consolidateFiles", "true")

I also do have control on how all these datasets are generated. None of them seem to have a partitioner set (by looking at rdd.partitioner) and I don't see any API in SQLContext which will let me configure a partitioner when creating a data frame.

I am using scala and Spark 1.3

soontobeared
  • 441
  • 4
  • 9
  • 30
  • What is the configuration of your cluster ? – axlpado - Agile Lab Aug 27 '15 at 18:06
  • It's a 100 node cluster with 60G RAM and 745G disk space on each. My job configuration is 20G driver-memory, 20G executor-memory, 2 executor-cores, 120 num-executors, 33000 spark.sql.shuffle.partitions and 2 spark.driver.cores. – soontobeared Aug 27 '15 at 21:11
  • Can you post the information about the shuffled write & logs from the executor which is failing? Also how much space you have in the shuffle directory and maybe try tuning the shuffle memoryFraction. – Holden Aug 27 '15 at 21:56
  • Hello, there is about 15T space in the shuffle directory. However, the job is resulting in about 18T in shuffle writes and 10T in shuffle reads. I am trying to figure out if there's a way to repartition the data or read the bigger dataset in batches (perform the joins and union the partial datasets) to reduce the shuffling – soontobeared Aug 28 '15 at 18:10

1 Answers1

0

The partioner of your data depends on how where the RDD has come from. You shouldn't need to manually re-partition your data. However if you do-repartition your data so that they have the same partitioner then joining (& cogrouping) will result in a narrow transformation instead of having to do a shuffle as part of the join. Note that in newer versions of Spark (1.2+) the default shuffle is now a sort-based shuffle instead of the hash based shuffle.

Its difficult to say how to change your joins without the code & logs present (and it would be useful perhaps to also know what the distribution of ids looks like).

You can try increasing the number of partitions (both as input & as output) incase there is an issue with unbalanced data. One possibility is that your scratch space is simply too small, you can configure Spark to use a different directory for temporary storage with spark.local.dir. If your object are kyro serializeable (or if you have the time to add it) you may also want to look at changing spark.serializer since a different serialization can take up much less space.

While not directly related to job completion, you may also wish to increase spark.shuffle.memoryFraction and decrease spark.storage.memoryFraction so as to reduce the amount of spilling to disk required during the shuffle.

One option, if you had slightly differently structured data, would be to use cogroup which supports joining many RDD's at the same time, but that requires that all of the keys are the same.

Note: this all assumes you are working with raw Spark instead of Spark SQL. For tuning Spark SQL joins take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html (especially consider tuning spark.sql.shuffle.partitions).

Hope this helps.

Holden
  • 7,392
  • 1
  • 27
  • 33
  • Thanks for your response. I am using hive sql to perform the joins. I updated my original post with the configuration I am using. – soontobeared Aug 27 '15 at 21:51