0

Before this question is flagged duplicate, please read through.

This has been asked many number of times with no clear answer. Lets say my task is to compute unigram probability for every word in millions of files. I can emit word counts from mappers and reducers can aggregate the counts for each word. However to compute probabilities, we need total number of words. One way to do would be to send the number of words from each mappers to each reducers with a special key, and sort the keys such that these counts arrive earlier than the individual counts. A reducer can, then simply add up the counts received from mappers to obtain grand total number of words.

But how do I make mappers send counts to all reducers?

I can find out the total number of reducers from job properties, say it is n. Is there not a way to call Context.write() n number of times from each mapper and specifying partition number from 0 to n-1 in turn so that data reaches to all the reducers?

abhinavkulkarni
  • 2,284
  • 4
  • 36
  • 54
  • @cabad: There are arguments against not using counters. The counters are supposed to be for aggregating job-level statistics. Additionally, in the new API, the counters are write only. See a reply from Robert Evans, a `Hadoop` committer here: http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201112.mbox/%3CCADYHM8xiw8_bF=ZQe-bAgdFZ6r3tOb0AOF9ViOZgtZEQGkPZVA@mail.gmail.com%3E – abhinavkulkarni Oct 10 '13 at 23:01

1 Answers1

1

You could use a custom Partitioner for this purpose.

Given the number of reducers n you can emit your wordcount n times with the keys 1, 2, ... n in your mapper. A custom Partitioner class will ensure, that reducer i gets only values with key i.

harpun
  • 4,022
  • 1
  • 36
  • 40
  • 1
    Wouldn't all the reducers generate the same aggregate count? – cabad Oct 10 '13 at 21:53
  • 1
    @cabad the OP asked for a way to emit values to all reducers from one mapper. Ensuring that this suites the OP's needs is his task. In my opinion the question does not clearly state what problem is actually being solved. – harpun Oct 10 '13 at 22:03
  • I understand your point, however other people may read this answers in the future and may not notice that having 100 reducers, all produce the same answer ("The sum is X!") is not a good idea. At least they can read the comments and see if this is really what they want. – cabad Oct 10 '13 at 22:09
  • @harpun: I have phrased the question better. Hopefully the scenario is clear now. I think your strategy would work. – abhinavkulkarni Oct 10 '13 at 22:43