Map Reduce flow in Hadoop

Question

I'm learning Hadoop using the book Hadoop in Practice, and while reading chapter 1 i came across this diagram:

enter image description here

From the Hadoop docs:(http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapred/Reducer.html)

1.Shuffle

Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.

2.Sort

The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

While i understand that shuffle and sorting happens at the same time, it's not clear to me how the framework decides which reducer receives which mapper output. From the docs, it seems that each reducer has a way to know which mapoutput to collect, but i can't understand how.

So my question is, given the mappers output above, the final result is always the same for each reducer? If so, what are the steps to achieve this result?

Thanks for any clarifications!

possible duplicate of [Hadoop - How does reducer gets it data?](http://stackoverflow.com/questions/10527271/hadoop-how-does-reducer-gets-it-data) - see also: http://stackoverflow.com/questions/20757318/what-two-different-keys-go-to-the-same-reducer-by-the-default-hash-partitioner-i and http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Partitioner.html — Brian Roach, Jan 04 '14 at 03:00

score 1 · Accepted Answer · answered Jan 04 '14 at 03:07

1

It is the Partitioner that decides how to distribute the output of mappers to different reducers.

Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.

answered Jan 04 '14 at 03:07

keelar

5,814
7
40
79

1

Thanks, this made it a little bit clearer, i'll need to meditate to understand this! – Fernando Jan 04 '14 at 03:11
1

You can also tell your mapreduce program to set the no of reducers using the job property...`job.setNumReduceTasks(No_of_reducers_you_want)` .Do note that the argument is a int element, So you can explicitly define the no of reducers based on how many partitions you want – Jijo Jan 18 '14 at 05:48

Map Reduce flow in Hadoop

1.Shuffle

2.Sort

1 Answers1