1

While learning Hadoop MapReduce, I came across how to create a custom Partitioner class. I understand that we need to define the abstract getPartition method in our class. This method is supposed to return the Partition number (an integer) for the current key-value pair.

Now, the number of partitions will be equal to the number of reduce tasks for the job. What if in a custom partitioner, one writes some logic to select the partition based on the 'value' and not the 'key'? With my understanding, this could mean that records having the same key (but different values) might be processed by different reduce tasks, which is not what we are guaranteed by MapReduce. Isn't this an anomaly? And why do we even need the 'value' argument in getPartition(key, value, numPartitions) method? Please correct my understanding if incorrect.

Ankit Khettry
  • 997
  • 1
  • 13
  • 33

2 Answers2

2

Partitions can be made based on intermediate(output of mapper before spilling data to disk) key or value. When you partition based on value, two different partitions can have records having same keys.

Shravanya
  • 97
  • 1
  • 5
1

The partitioner operates on intermediate key,value pairs which is nothing but the map output before spilling the data to disk. Since it operates on map output it uses the same writable which is specified for a map key and value, hence it uses both key and value. The main idea of partitioner is to avoid the skew of single reducer getting almost all the data, so there is no need to use value while calculating the partition index. And as per The definitive guide even the value is ignored.

Vignesh I
  • 2,211
  • 2
  • 20
  • 40
  • Thanks for the response. I came across the following link, where the partition is chosen based on the value. I am curious to know what happens in this case. Can two different partitions receive records having the same key? [http://hadooptutorial.wikispaces.com/Custom+partitioner](http://hadooptutorial.wikispaces.com/Custom+partitioner) – Ankit Khettry Sep 02 '15 at 12:48