Does the organization of dataset size matter when using join in Apache Spark?

Question

I have two RDDs I want to join. One is very large, XL and the other is regular sized, M. For speed, does it matter which order I join them? For example:

val data = M.join(XL)

vs

val data =XL.join(M)

score 2 · Accepted Answer · answered Apr 02 '15 at 19:29

On 'core' Spark, if you are using join, the order will not matter. But you could optimize the join using use a broadcast variable and do the join with a map.

val bcSmallData = sc.broadcast(sRDD)
xlRDD.map{ case (k,v) => (k,(v, scSmallData.value.get(k)))}

See this 'Advanced Spark' presentation for a detailed explanation.

Now, if you use SparkSQL, this optimization is done automagically for you. There's a configuration option (spark.sql.autoBroadcastJoinThreshold) that controls the threshold size of the smaller table for it to be broadcasted. The order of the join will not matter. The query optimizer will look at the RDD size.

score 0 · Answer 2 · edited May 23 '17 at 11:43

0

According to this answer, it does not matter. I am not sure that the other question is the same given it is asking about tables rather than RDDs. The asker may be referring to tables being joined in SparkSQL, but the answer is about RDDs.

edited May 23 '17 at 11:43

Community

1
1

answered Mar 27 '15 at 12:45

Climbs_lika_Spyder

6,004
3
39
53

Does the organization of dataset size matter when using join in Apache Spark?

2 Answers2