I have two RDDs I want to join. One is very large, XL
and the other is regular sized, M
. For speed, does it matter which order I join them? For example:
val data = M.join(XL)
vs
val data =XL.join(M)
I have two RDDs I want to join. One is very large, XL
and the other is regular sized, M
. For speed, does it matter which order I join them? For example:
val data = M.join(XL)
vs
val data =XL.join(M)
On 'core' Spark, if you are using join
, the order will not matter. But you could optimize the join using use a broadcast variable and do the join with a map.
val bcSmallData = sc.broadcast(sRDD)
xlRDD.map{ case (k,v) => (k,(v, scSmallData.value.get(k)))}
See this 'Advanced Spark' presentation for a detailed explanation.
Now, if you use SparkSQL, this optimization is done automagically for you. There's a configuration option (spark.sql.autoBroadcastJoinThreshold
) that controls the threshold size of the smaller table for it to be broadcasted. The order of the join will not matter. The query optimizer will look at the RDD size.
According to this answer, it does not matter. I am not sure that the other question is the same given it is asking about tables rather than RDDs. The asker may be referring to tables being joined in SparkSQL, but the answer is about RDDs.