17

I am very new in hadoop world and struggling to achieve one simple task.

Can anybody please tell me how to get top n values for word count example by using only Map reduce code technique?

I do not want to use any hadoop command for this simple task.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
user3078014
  • 171
  • 1
  • 1
  • 4
  • 2
    What do you mean "I do not want to use any hadoop command for this simple task"? – Donald Miner Dec 14 '13 at 12:27
  • 1
    In Hadoop, the reducer sorts the output on the basis of the value of keys. So while writing the output, if we just swap the key and value, i. e., write the value (which will be the count) as the key and the key as the value, then it'll sort on the basis of values. Then all we have to do is run the command: hadoop fs -cat | tail -n where n is the top n values you want to know. But I donot want to use the above command to accomplish the task.I just want to do it by map reduce progamming only. – user3078014 Dec 14 '13 at 12:31
  • 1
    Wrong. The reducer does not sort the output. The reducer sorts its input from the mappers! Big difference! – Donald Miner Dec 14 '13 at 12:42
  • 1
    Also, you are selling yourself short by not wanting to do `tail -n`. Why solve a problem that doesn't require parallel programming with parallel programming? You really want to run a M/R job over a few thousand records?? – Donald Miner Dec 14 '13 at 12:42

1 Answers1

23

You have two obvious options:


Have two MapReduce jobs:

  1. WordCount: counts all the words (pretty much the example exactly)
  2. TopN: A MapReduce job that finds the top N of something (here are some examples: source code, blog post)

Have the output of WordCount write to HDFS. Then, have TopN read that output. This is called job chaining and there are a number of ways to solve this problem: oozie, bash scripts, firing two jobs from your driver, etc.

The reason you need two jobs is you are doing two aggregations: one is word count, and the second is topN. Typically in MapReduce each aggregation requires its own MapReduce job.


First, have your WordCount job run on the data. Then, use some bash to pull the top N out.

hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n20

sort -n -k2 -r says "sort numerically by column #2, in descending order". head -n20 pulls the top twenty.

This is the better option for WordCount, just because WordCount will probably only output on the order of thousands or tens of thousands of lines and you don't need a MapReduce job for that. Remember that just because you have hadoop around doesn't mean you should solve all your problems with Hadoop.


One non-obvious version, which is tricky but a mix of both of the above...

Write a WordCount MapReduce job, but in the Reducer do something like in the TopN MapReduce jobs I showed you earlier. Then, have each reducer output only the TopN results from that reducer.

So, if you are doing Top 10, each reducer will output 10 results. Let's say you have 30 reducers, you'll output 300 results.

Then, do the same thing as in option #2 with bash:

hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n10

This should be faster because you are only postprocessing a fraction of the results.

This is the fastest way I can think of doing this, but it's probably not worth the effort.

Donald Miner
  • 38,889
  • 8
  • 95
  • 118
  • Hi Donald,Many thanks for your solution.Can you please tell me that this approach is also correct which i am writing here==>In Hadoop, the reducer sorts the output on the basis of the value of keys. So while writing the output, if we just swap the key and value, i. e., write the value (which will be the count) as the key and the key as the value, then it'll sort on the basis of values. Then all we have to do is run the command: hadoop fs -cat | tail -n where n is the top n values we want to know. – user3078014 Dec 14 '13 at 12:50
  • So we need to run two map reduce jobs to accomplish this task.First job to find normal words and its corresponding counts and second job is to find top N of something i,e top n from every reducer.I am not very clear about second job in code.How it is working and then how we are getting finally the top n values out of all reducers output?How it is calculating top n of something every time and then finally generating exact top n values? – user3078014 Dec 14 '13 at 13:11
  • Can you please explain the command ' sort -n -k2 -r | head -n20'..like what is n,k2 and r in the command? – user3078014 Dec 14 '13 at 14:57
  • 2
    Your approach is not correct. Hadoop does not sort output from the reducers. It sorts the input to the reducer. So swapping it will do nothing. For the code in top N, look at the links I provide. `-n` means sort numerically. `-k2` means sort on column 2. `-r` means sort in descending order. Run `man sort` to learn more about sort. – Donald Miner Dec 14 '13 at 16:18
  • Great answer and great blog post for topN in mapreduce! – vefthym May 21 '14 at 09:35
  • 1
    Your last suggestion will not always be correct. A situation that breaks your algorithm: 10 Reduce jobs, with 100 occurrences for each of their top 10 items, all disjoint between map jobs, and an 11th item that has 50 occurrences; this is actually the top item, with 500 occurrences between all jobs, but the reducers will never catch it, and it will be omitted, and the final reduce stage will never see it. – David Manheim Jun 13 '14 at 20:04
  • 2
    I dont think you are understanding, david. The reducers group by the item name, so your hypothetical 11th item would be seen as the greatest in the reducer that got it. It can't be spread out. – Donald Miner Jun 14 '14 at 20:40