0

As we know, that during the shuffle phase of hadoop, each of the reducer read data from all the mapper's output (intermedia data).

Now, we also know that by default Hash-Partitioning is used for reducers.

My question is: How do we implement an algorithm, e.g. Locality-aware?

Ashrith
  • 6,745
  • 2
  • 29
  • 36
Cloud
  • 199
  • 2
  • 14

1 Answers1

0

In short, you should not do it.

First, you have no control over where the mappers and reducers are executed on the cluster, so even when the complete output of a single mapper will go to a single reducer there is a huge probability that they would be on different hosts and the data would be transferred through the network

Second, to make the reducer process the whole output of the mapper, you first have to make mapper process the right part of the information, which means that you have to preprocess data by partitioning it and then run a single mapper and a single reducer for each partition, but this preprocessing itself would take much resources so it is mostly meaningless

And finally, why do you need it? The main concept of map-reduce is manipulation with key-value pairs, and reducer in general should aggregate list of values outputted by the mappers for the same keys. Here's why hash partitioning is used: distribute N keys between K reducers. Using different type of partitioner is a really seldom case. If you need data locality you might prefer to work with MPP database rather than Hadoop, for example.

If you really need a custom partitioner, here's an example of how it can be implemented: http://hadooptutorial.wikispaces.com/Custom+partitioner. Nothing special, just return reducer number based on the key and value passed and the number of reducers. Using hash code of the host name divided (%) by the number of reducers will make the whole output of a single mapper go to a single reducer. Also you might use process PID % number of reducers. But before doing this you have to check, whether you really need this behavior or not.

0x0FFF
  • 4,948
  • 3
  • 20
  • 26