why Hadoop combiner output not merged by reducer

Question

I ran a simple wordcount MapReduce example adding combiner with a small change in combiner output, The output of combiner is not merged by reducer. scenario is as follows

Test: Map -> Combiner ->Reducer

In combiner i added two extra lines to out put a word different and count 1, reducer is not suming the "different" word count. output pasted below.

Text t = new Text("different"); // Added a my own output

context.write(t, new IntWritable(1)); // Added my own output

public class wordcountcombiner extends Reducer<Text, IntWritable, Text, IntWritable>{

  @Override
  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
  {
    int sum = 0;
    for (IntWritable val : values)
    {
        sum += val.get();
    }
    context.write(key, new IntWritable(sum));
    Text t = new Text("different"); // Added my own output
    context.write(t, new IntWritable(1)); // Added my own output
  }
}

Input:

I ran a simple wordcount MapReduce example adding combiner with a small change in combiner output, The output of combiner is not merged by reducer. scenario is as follows In combiner I added two extra lines to out put a word different and count 1, reducer is not suming the "different" word count. output pasted below.

Output:

"different" 1
different   1
different   1
I           2
different   1
In          1
different   1
MapReduce   1
different   1
The         1
different   1
...

How can this happen?

fullcode: I ran wordcount program with combiner and just for fun i tweaked it in combiner, so i faced this issue. I have three separate classes for mapper, combiner and reducer.

Driver:

public class WordCount {

  public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    // TODO Auto-generated method stub

    Job job = Job.getInstance(new Configuration());
    job.setJarByClass(wordcountmapper.class);
    job.setJobName("Word Count");

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(wordcountmapper.class);
    job.setCombinerClass(wordcountcombiner.class);
    job.setReducerClass(wordcountreducer.class);
    job.getConfiguration().set("fs.file.impl", "com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem");       

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    System.exit(job.waitForCompletion(true)? 0 : 1);

  }

}

Mapper:

public class wordcountmapper extends Mapper<LongWritable, Text, Text, IntWritable> {

  private Text word = new Text();
  IntWritable one = new IntWritable(1);
  @Override
  public void map(LongWritable key, Text value, Context context) 
        throws IOException, InterruptedException 
  {
    String line = value.toString();
    StringTokenizer token = new StringTokenizer(line);
    while (token.hasMoreTokens())
    {
        word.set(token.nextToken());
        context.write(word, one);
    }
  }
}

Combiner:

public class wordcountcombiner extends Reducer<Text, IntWritable, Text, IntWritable>{

  @Override
  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
  {
    int sum = 0;
    for (IntWritable val : values)
    {
        sum += val.get();
    }
    context.write(key, new IntWritable(sum));
    Text t = new Text("different");
    context.write(t, new IntWritable(1));
  }
}

Reducer:

public class wordcountreducer extends Reducer<Text, IntWritable, Text, IntWritable>{

  @Override
  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
  {
    int sum = 0;
    for (IntWritable val : values)
    {
        sum += val.get();
    }
    context.write(key, new IntWritable(sum));
  }
}

don't use the same class for combiner and reducer... In the reducer remove the last two commands — vefthym, Aug 06 '14 at 12:55
used separate classes for combiner and reducer. tweak was made in combiner to check final reducer output. — mlreddy, Aug 06 '14 at 16:33
It doesn't make sense to me, if you run this code. Did you perhaps run an older jar? Also, why do you have this line: `job.setJarByClass(wordcountmapper.class);`? Shouldn't it be `job.setJarByClass(WordCount.class);`? — vefthym, Aug 07 '14 at 07:08

score 3 · Answer 1 · answered Aug 06 '14 at 12:49

3

The output is normal because you're having two lines doing wrong things : Why are you having this code

Text t = new Text("different"); // Added my own output
context.write(t, new IntWritable(1)); // Added my own output

In your reducer you're doing the sum and then you're adding to the output different 1 ....

answered Aug 06 '14 at 12:49

Ko2r

1,541
1
11
24

combiner and reducer classes have different, above code is for combiner and reducer is below public class wordcountreducer extends Reducer{ @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } – mlreddy Aug 06 '14 at 16:22

score 0 · Answer 2 · answered Aug 06 '14 at 12:49

0

You are writing in the final output of the job a new "1 different" in the reduce function, without doing any kind of aggregation. The reduce function is called once per key, as you can see in the method signature, it takes as arguments a key and the list of values for that key, which means that it is called once for each of the keys.

Since you are using as key a word, and in each call of reduce you are writing to the output "1 different", you will get one of those for each of the words in the input data.

answered Aug 06 '14 at 12:49

Balduz

3,560
19
35

combiner and reducer classes have different, above code is for combiner and reducer is below public class wordcountreducer extends Reducer{ @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } – mlreddy Aug 06 '14 at 16:23
Ok, the problem is that the combiner is called after the map phase, and also after the reduce phase. – Balduz Aug 06 '14 at 19:38
@Balduz I don't think that this is possible. Why do you believe so? – vefthym Aug 07 '14 at 07:03
1

Because that is how combiners work. See the answer: http://stackoverflow.com/questions/14126964/two-equal-combine-keys-do-not-get-to-the-same-reducer – Balduz Aug 07 '14 at 07:08

score 0 · Answer 3 · edited Jan 08 '17 at 01:19

hadoop requires that the reduce method in the combiner writes only the same key that it receives as input. This is required because hadoop sorts the keys only before the combiner is called, it does not re-sort them after the combiner has run. In your program, the reduce method writes the key "different" in addition to the key that it received as input. This means that the key "different" then appears in different positions in the order of keys, and these occurrences are not merged before they get passed to the reducer.

For example:

Assume the sorted list of keys output by the mapper is: "alpha", "beta", "gamma"

Your combiner is then called three times (once for "alpha", once for "beta", once for "gamma") and produces keys "alpha", "different", then keys "beta", "different", then keys "gamma", "different".

The "sorted" (but actually not sorted) list of keys after the combiner has executed is then:

"alpha", "different", "beta", "different", "gamma", "different"

This list does not get sorted again, so the different occurrences of "different" do not get merged.

The reducer is then called separately six times, and the key "different" appears 3 times in the output of the reducer.

why Hadoop combiner output not merged by reducer

Input:

Output:

3 Answers3