3

I have a map reducer job which has to output in multiple outputs, I am using multipleOutputFormat as in this example: http://grepalex.com/2013/05/20/multipleoutputs-part1/

Here is the challenge:

  • If my partitioner sends each reducer one key (assume key refers to the separate output files ), then some of my reducers with a lot of data, takes forever.
  • If my partitioner sends each reducer randomly (theKey+randomNumber), then many reducers write to multiple outputs, I have IO problem.

As a solution:

  • option1: Allocate keys to reducers according to their weight. So all of the reducers have same load. (1 large key is sent to 5 different reducers whereas 6 small keys sent to single reducer)

  • option2: Again allocate keys according to their weight but make sure a reducer can only take one key. (1 large key is sent to 5 different, but 6 small keys sent to seperate reducers as well)

Pros & Cons:

  • In option 1: I have balanced reducers but some reducers write to different files (small keys).
  • In option 2: I have unbalanced reducers, but limited the maximum load on a reducer and each reducer writes to its own file.

Both of these options run in reasonable amount of time. I need some advise about which one I should go with.

Thanks

Zword
  • 6,605
  • 3
  • 27
  • 52
sahara
  • 143
  • 1
  • 8
  • What is the point in using a reducer if you don't need to group your results by key? Let the mapper write to the destination directly. – Thomas Jungblut Feb 04 '14 at 19:22

2 Answers2

0

Option 1 seems to be the best option. The execution time of both options will be close to the same, but Option 1 minimizes the overhead work that needs to be done to run each reducer.

LeonardBlunderbuss
  • 1,264
  • 1
  • 11
  • 22
0

opt 2 is better There is another option: add one more column as key, just use columns in inputting data,so no more random keys needed.