0

I have a confusion about the implementation of Hadoop.

I notice that when I run my Hadoop MapReduce job with multiple mappers and reducers, I would get many part-xxxxx files. Meanwhile, it is true that a key only appears in one of them.

Thus, I am wondering how MapReduce works such that a key only goes to one output file?

Thanks in advance.

Zz'Rot
  • 824
  • 1
  • 7
  • 24

2 Answers2

3

The shuffle step in the MapReduce process is responsible for ensuring that all records with the same key end up in the same reduce task. See this Yahoo tutorial for a description of the MapReduce data flow. The section called Partition & Shuffle states that

Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin.

Alex A.
  • 2,646
  • 22
  • 36
2

Shuffle

Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.

Sort

The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.

The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

I got this from here

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

Have a look on it i hope this will helpful

backtrack
  • 7,996
  • 5
  • 52
  • 99