How to get the maximum word count in Hadoop?

Question

I have managed to get my Word Count program underwraps and now I want to be able to get the maximum occurrence.

My output for my WordCount looks like this:

File1:Word1: x
File1:Word2: x

Where File represents a File, Word represents the searched Word and x is the count.

I want to get the maximum number for these word counts. So, going to my example:

File1:Word1: 4
File1:Word2: 10
File2:Word1: 4
File2:Word2: 1

I would like Word1 of File1 and Word1 of File 2 to be incremented by 1 because this is the maximum word count for the words for the particular file(s).

Unfortunately, I am having a tough time getting the output I would like.

My map function looks like this:

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter)
        throws IOException { 

    String parsedLine = value.toString();
    String[] pieces = parsedLine.split(":");
    StringTokenizer tokenizer = new StringTokenizer(pieces[1]);

    while (tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken();
        outputCollector.collect(new Text(token), ONE);
    }
}

And my Reduce looks like this:

private int maximum = 0;

@Override
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter)
        throws IOException {

    Text occuredKey = new Text();

    int total = 0;
    while (values.hasNext()) {
        total += values.next().get();
    }

    if (total > maximum) {
        maximum = total;
        occuredKey.set(key);
    }
    outputCollector.collect(occuredKey, new IntWritable(total));
}

I have tried several things:

Put the keywords (Word1, Word2 for example here) in a Map and that wasn't working.
Iterate through in my Map and if the word was found, put it in a List and then compare List sizes

My understanding is the first job's output is the second job's input but that doesn't seem right since I can't access the count from the first job.

Any help is appreciated, I have been stuck for a bit on this.

To be clear on the output:

I have 60 files and each file has the same 5 words that were searched for in my Word Count. So I have 60 x 5 total records in my output file for the first job. The second job will take the 5 words and count how many times that word was the highest of the collection of 5 for each file. So, my output for this should be 5 records and the total count for these 5 records should equal 60

Its difficult to understand what exactly you are looking for ? Are you looking for word with maximum count per file? or you want counts for each word per file ? From your code it seems like occuredKey will be empty if total for word is less than maximum. If you paste your actual output here it will be useful. — Vikram Patil, Feb 21 '19 at 04:11
I have 60 files and each file has the same 5 words that were searched for in my Word Count. So I have 60 x 5 total records in my output file for the first job. The second job will take the 5 words and count how many times that word was the highest of the collection of 5 for each file. So, my output for this should be 5 records and the total count for these 5 records should equal 60. — Namorange, Feb 21 '19 at 04:23
@Namorange I guess you want to put this `occuredKey.set(key);` out of `if` statement. — ViKiG, Feb 21 '19 at 06:01
Would I need to pass the maximum as the second parameter to the outputCollector? — Namorange, Feb 21 '19 at 12:33

How to get the maximum word count in Hadoop?

0 Answers0