MapReduce - Emitting the top 20% occurring words in the document

Question

I was reading about MapReduce here , and the first example they give is counting the number of occurrences for each word in the document. I was wondering, suppose you wanted to get the top 20% occurring words in the document, how can you achieve that? it seems unnatural since each node in the cluster cannot see the whole files, just the list of all occurrences for a single word. Is there way to achieve that?

score 0 · Accepted Answer · edited May 23 '17 at 12:27

Yes you certainly can achieve this : by forcing hadoop to have just a single reducer (though with this approach you lose the advantage of distributed computing per se).

This can be done as follows:

// Configuring mapred to have just one reducer
conf.setInt("mapred.tasktracker.reduce.tasks.maximum", 1);
conf.setInt("mapred.reduce.tasks", 1);

Now since you have just one reducer, you can keep track of the top 20% and emit them out in run() or cleanup() of the reducer. See here for more.

MapReduce - Emitting the top 20% occurring words in the document

1 Answers1