In Spark, when no partitioner is specified, does the ReduceByKey operation repartition the data by hash before starting aggregating it?

Question

If we do not mention any partitioner for a reduceByKey operation, does it perform hashPartitioning internally before the reduction? For example my test code is like:

val rdd = sc.parallelize(Seq((5, 1), (10, 2), (15, 3), (5, 4), (5, 1), (5,3), (5,9), (5,6)), 5)
val newRdd = rdd.reduceByKey((a,b) => (a+b))

Here, does the reduceByKey operation brings all records with same key to the same partition and the perform the reduction (for the above code when no partitioner is mentioned)? Since my use case has skewed data (all having same key), it can cause out of memory error if it brings all records to one partition. So a uniform distribution of the records over all the partitions suits the use case here.

Oli · Accepted Answer · 2018-08-13T14:16:35.143

2

In fact, the great advantage of using reduceByKey instead of groupByKey is that spark will combine keys that are on the same partition before the shuffle (that is before re partitioning anything). Therefore it is very unlikely to get memory issues because of skewed data using reduceByKey.

For more details, you may want to read this post from databricks that compares reduceByKey vs groupByKey. In particular, they say this:

While both of these functions will produce the correct answer, the reduceByKey example works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.

edited Aug 13 '18 at 14:16

answered Aug 13 '18 at 13:53

Oli

9,766
5
25
46

Thanks Oli. Can you look into the other doubt, which is "Does reduceByKey perform hashPartitioning before the reduction?" – Sayantan Ghosh Aug 13 '18 at 13:57
The answer is no. When I say "that spark will combine keys that are on the same partition before the shuffle", it means that keys that are equal will be combined before spark performs the hashPartitioning. I edited the answer to make it clearer ;-) – Oli Aug 13 '18 at 14:15
Thanks Oli for the clarification. – Sayantan Ghosh Aug 14 '18 at 18:10

In Spark, when no partitioner is specified, does the ReduceByKey operation repartition the data by hash before starting aggregating it?

1 Answers1