Actually, the two qualities you are talking about are somewhat unrelated.
For reduceByKey()
, the first quality aggregates elements of the same key with the provided associative reduce function locally first on each executor and then eventually aggregated across executors. It is encapsulated in a boolean parameter called mapSideCombine
which if set to true does the above. If set to false, as it is with groupByKey()
, each record will be shuffled and sent to the correct executor.
The second quality concerns partitioning and how it is used. Each RDD, by virtue of its definition, contains a list of splits and (optionally) a partitioner. The method reduceByKey()
is overloaded and actually has a few definitions. For example:
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
This definition of the method actually uses the default existing partitioner from the parent RDD and reduces to the number of partitions set as the default parallelism level.
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
This definition of the method will use a HashPartitioner
to appropriate data to their corresponding executors and the number of partitions will be numPartitions
.
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
Finally, this definition of the method supersedes the other two and takes in a generic (perhaps, custom) partitioner that will produce the number of partitions determined by how that partitioner partitions the keys.
The point of that is that you can actually encode your desired partitioner logic within the reduceByKey()
itself. If your intention was to avoid shuffling overhead by pre-partitioning, it doesn't really make sense either since you will still be shuffling on your pre-partition.