Hadoop how to allocate to reducers to handle unbalanced load - CustomPartition

Question

I have a map reducer job which has to output in multiple outputs, I am using multipleOutputFormat as in this example: http://grepalex.com/2013/05/20/multipleoutputs-part1/

Here is the challenge:

If my partitioner sends each reducer one key (assume key refers to the separate output files ), then some of my reducers with a lot of data, takes forever.
If my partitioner sends each reducer randomly (theKey+randomNumber), then many reducers write to multiple outputs, I have IO problem.

As a solution:

option1: Allocate keys to reducers according to their weight. So all of the reducers have same load. (1 large key is sent to 5 different reducers whereas 6 small keys sent to single reducer)
option2: Again allocate keys according to their weight but make sure a reducer can only take one key. (1 large key is sent to 5 different, but 6 small keys sent to seperate reducers as well)

Pros & Cons:

In option 1: I have balanced reducers but some reducers write to different files (small keys).
In option 2: I have unbalanced reducers, but limited the maximum load on a reducer and each reducer writes to its own file.

Both of these options run in reasonable amount of time. I need some advise about which one I should go with.

Thanks

What is the point in using a reducer if you don't need to group your results by key? Let the mapper write to the destination directly. — Thomas Jungblut, Feb 04 '14 at 19:22

score 0 · Answer 1 · answered Feb 04 '14 at 19:16

0

Option 1 seems to be the best option. The execution time of both options will be close to the same, but Option 1 minimizes the overhead work that needs to be done to run each reducer.

answered Feb 04 '14 at 19:16

LeonardBlunderbuss

1,264
1
11
22

score 0 · Answer 2 · answered Mar 17 '16 at 08:17

0

opt 2 is better There is another option: add one more column as key, just use columns in inputting data,so no more random keys needed.

answered Mar 17 '16 at 08:17

杨贺林

1

@mjp66 How is not an answer? The op ask which option to use and this answer suggest using option 2. – NathanOliver Mar 17 '16 at 21:53

Hadoop how to allocate to reducers to handle unbalanced load - CustomPartition

2 Answers2