I was trying to analyse the default map reduce job, that doesn't define a mapper or a reducer. i.e. one that uses IdentityMapper & IdentityReducer To make myself clear I just wrote my identity reducer
public static class MyIdentityReducer extends MapReduceBase implements Reducer<Text,Text,Text,Text> {
@Override
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
while(values.hasNext()) {
Text value = values.next();
output.collect(key, value);
}
}
}
My input file was :
$ hadoop fs -cat NameAddress.txt
Dravid Banglore
Sachin Mumbai
Dhoni Ranchi
Dravid Jaipur
Dhoni Chennai
Sehwag Delhi
Gambhir Delhi
Gambhir Calcutta
I was expecting
Dravid Jaipur
Dhoni Chennai
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi
I got
$ hadoop fs -cat NameAddress/part-00000
Dhoni Ranchi
Dhoni Chennai
Dravid Banglore
Dravid Jaipur
Gambhir Delhi
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi
I was of the opinion that since the aggregations are done by the programmer in the while loop of the reducer and then written to the outputcollector. I was of the impression that the keys of the reducer passed to outputcollector are always unique & since here if i don't aggregate, the last key's values overrides the previous value. Clearly its not the case. Could someone please give me a better insite of the outputcollector, how it works and how it handles all the keys. I see many implementations of outputcollector in the hadoop src code. Can i write my own outputcollector that can do what i am expecting?