I have read that the combiner reduces the network traffic between mappers and reducers. It is kind of like a semi-reducer which summarises the results before they are sent across the network to the reducers.
However, I am not able to understand the following example given here (look at the diagram they have depicted, the combiner combines the keys from various mappers which is something I do not understand. I think combiner has to reduce the data in one mapper only and hence no network traffic. Then the output of the combiner is sent across the network for second level of final aggregation):
Input text:
What do you mean by Object
What do you know about Java
What is Java Virtual Machine
How Java enabled High Performance
Assuming each line goes to a separate mapper
Map phase output:
<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>
Claimed Combiner phase output:
<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
I do not understand how the combiner could know <What,1,1,1,1>
since what is present in different mappers. I would have expected the output of the combiner to be the semi reduced summary of each line like so:
<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>
Which is the same as the mapper output but would have been different if any of the words were repeated in a line.
Can you help me understand if my understanding of a combiner is right? If not what am I misunderstanding?