I have managed to get my Word Count program underwraps and now I want to be able to get the maximum occurrence.
My output for my WordCount looks like this:
File1:Word1: x
File1:Word2: x
Where File represents a File, Word represents the searched Word and x is the count.
I want to get the maximum number for these word counts. So, going to my example:
File1:Word1: 4
File1:Word2: 10
File2:Word1: 4
File2:Word2: 1
I would like Word1 of File1 and Word1 of File 2 to be incremented by 1 because this is the maximum word count for the words for the particular file(s).
Unfortunately, I am having a tough time getting the output I would like.
My map function looks like this:
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter)
throws IOException {
String parsedLine = value.toString();
String[] pieces = parsedLine.split(":");
StringTokenizer tokenizer = new StringTokenizer(pieces[1]);
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
outputCollector.collect(new Text(token), ONE);
}
}
And my Reduce looks like this:
private int maximum = 0;
@Override
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter)
throws IOException {
Text occuredKey = new Text();
int total = 0;
while (values.hasNext()) {
total += values.next().get();
}
if (total > maximum) {
maximum = total;
occuredKey.set(key);
}
outputCollector.collect(occuredKey, new IntWritable(total));
}
I have tried several things:
Put the keywords (Word1, Word2 for example here) in a Map and that wasn't working.
Iterate through in my Map and if the word was found, put it in a List and then compare List sizes
My understanding is the first job's output is the second job's input but that doesn't seem right since I can't access the count from the first job.
Any help is appreciated, I have been stuck for a bit on this.
To be clear on the output:
I have 60 files and each file has the same 5 words that were searched for in my Word Count. So I have 60 x 5 total records in my output file for the first job. The second job will take the 5 words and count how many times that word was the highest of the collection of 5 for each file. So, my output for this should be 5 records and the total count for these 5 records should equal 60