I'm learning Hadoop using the book Hadoop in Practice
, and while reading chapter 1 i came across this diagram:
From the Hadoop docs:(http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapred/Reducer.html)
1.Shuffle
Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.
2.Sort
The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
While i understand that shuffle
and sorting
happens at the same time, it's not clear to me how the framework decides which reducer
receives which mapper
output. From the docs, it seems that each reducer
has a way to know which map
output to collect, but i can't understand how.
So my question is, given the mappers output above, the final result is always the same for each reducer? If so, what are the steps to achieve this result?
Thanks for any clarifications!