How outputcollector works?

Question

I was trying to analyse the default map reduce job, that doesn't define a mapper or a reducer. i.e. one that uses IdentityMapper & IdentityReducer To make myself clear I just wrote my identity reducer

public static class MyIdentityReducer extends MapReduceBase implements Reducer<Text,Text,Text,Text> {
        @Override
        public void reduce(Text key, Iterator<Text> values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {
            while(values.hasNext()) {
                Text value = values.next();
                output.collect(key, value);
            }
        }   
    }

My input file was :

$ hadoop fs -cat NameAddress.txt
Dravid Banglore
Sachin Mumbai
Dhoni Ranchi
Dravid Jaipur
Dhoni Chennai
Sehwag Delhi
Gambhir Delhi
Gambhir Calcutta

I was expecting
Dravid Jaipur
Dhoni Chennai
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi

I got
$ hadoop fs -cat NameAddress/part-00000
Dhoni   Ranchi
Dhoni   Chennai
Dravid  Banglore
Dravid  Jaipur
Gambhir Delhi
Gambhir Calcutta
Sachin  Mumbai
Sehwag  Delhi

I was of the opinion that since the aggregations are done by the programmer in the while loop of the reducer and then written to the outputcollector. I was of the impression that the keys of the reducer passed to outputcollector are always unique & since here if i don't aggregate, the last key's values overrides the previous value. Clearly its not the case. Could someone please give me a better insite of the outputcollector, how it works and how it handles all the keys. I see many implementations of outputcollector in the hadoop src code. Can i write my own outputcollector that can do what i am expecting?

With an identity mapper and identity reducer, and i'm assuming the default input format (TextInputFormat), your above reducer should fail, as TextInputFormat outputs `` pairs. You should see the output in the same order as the input (assuming you are of course using identity mappers, reducers and TextInputFormat) — Chris White, Oct 06 '12 at 20:31
@Chris-White Yes, I added these to MyJob job.set("key.value.separator.in.input.line", " "); job.setInputFormat(KeyValueTextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); — S Kr, Oct 06 '12 at 20:33
The isn't restriction on reduce's output so it isn't obliged to produce records with unique keys. Therefore outputcollector doesn't owe to check keys, think of it as a version of System.out.println. — rystsov, Oct 07 '12 at 05:45

score 1 · Answer 1 · answered Oct 06 '12 at 20:23

1

The keys are unique for the reducer and each call to the reducer has a key value that's unique and an iterable of all values associated with that key. What you're doing is iterating over all of the values passed in and writing out each one.

So it doesn't matter that there might be fewer calls than data in your case. You still end up writing all of the values out.

answered Oct 06 '12 at 20:23

Chris Gerken

16,221
6
44
59

Corrected my question, the reducer key is always unique and contains a list of values. I want to know when all these are written as key,value to outputcollector, then does outputcollector doesn't check for uniqueness? is there a outputcollector that check for that. and how can i select a particular outputcollector for my mapreduce job – S Kr Oct 06 '12 at 20:30
If you want uniqueness you'll need to implement that in your reducer - whether you want the first / last / min / max etc - that's for you to implement – Chris White Oct 07 '12 at 16:30

How outputcollector works?

1 Answers1

Linked