2

So from the Hadoop tutorial website (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code) on how to implement word count using a map reduce approach I understand how it works and that the output will be all words with there frequency.

What I want to do is only have the output be the highest frequency word from the Input file I have.

Example: Jim Jim Jim Jim Tom Dane

I want the output just to be

Jim 4

The current Output from Word count is each word and it's frequency. Have anyone edited the Word count so that it just prints the highest frequency word and its frequency?

Does anyone have any tips on how to achieve this?

How would I write another MapReducer that will find the highest frequency word from the Output of WordCount?

Or is there another way?

Any help would be much appreciated.

Thank you!

WordCount.jave:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}
mike_tech123
  • 105
  • 2
  • 10
  • possible duplicate of [Top N values by Hadoop Map Reduce code](http://stackoverflow.com/questions/20583211/top-n-values-by-hadoop-map-reduce-code) – vefthym Mar 06 '15 at 08:11

3 Answers3

3

A possible way is to set the number of reducers to "1". Afterwards make a reducer remember the word with the highest frequency and write it to the output in cleanup. Like this:

public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {

    private Text tmpWord = new Text("");
    private int tmpFrequency = 0;

    @Override
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      if(sum > tmpFrequency) {
         tmpFrequency = sum;
         tmpWord = key;
      }
    }

    @Override
    public void cleanup(Context context) {
    // write the word with the highest frequency
        context.write(tmpWord, new IntWritable(tmpFrequency));
    }
}
Matthias Kricke
  • 4,931
  • 4
  • 29
  • 43
0

You won't be able to do in one step, reduce phase is performed independently for every key (synchronization is not possible) . Solution would be to run new MapReduce job that will aggregate output of your original WordCount job in one key and just select max. GL!

www
  • 4,365
  • 1
  • 23
  • 24
  • That completely makes sense. Do you know wha tthat MapReduce job would look like? I am not sure how to compare the Output and I am trying to write one JAR file in java that will run the wordcount and output just the word and its frequency. Do you know how to do that? I am stuck. – mike_tech123 Mar 06 '15 at 08:37
  • Such job would be rather simple. Map should generate key value pairs with same key, ex. (1,Jim 4), (1, Tom 1). Everything will go to the ONE reducer. Loop there should go through all of values and emit only one pair at end. GL. It's quite fun task! – www Mar 06 '15 at 08:57
  • Okay, I have been trying to create a way to look at the file from the output for the first mapreduce. Do you know what that one reducer would look like? Could you sketch some pseudo code so I have some point to start at. Just stuck right now. – mike_tech123 Mar 06 '15 at 09:44
  • Iterate thought all of values and keep the value with highest count. – www Mar 06 '15 at 09:59
0

If you force a run MapReduce with only one Reduce task, in the code you implement a search of major frequency of all key in a loop.

At the end of this, the output of the loop contain the key with major frequency. This pair you could send to final output (The context.write() sentence should be executed one time at the end).

Tuxman
  • 378
  • 4
  • 13