I have a map reducer job which has to output in multiple outputs, I am using multipleOutputFormat as in this example: http://grepalex.com/2013/05/20/multipleoutputs-part1/
Here is the challenge:
- If my partitioner sends each reducer one key (assume key refers to the separate output files ), then some of my reducers with a lot of data, takes forever.
- If my partitioner sends each reducer randomly (theKey+randomNumber), then many reducers write to multiple outputs, I have IO problem.
As a solution:
option1: Allocate keys to reducers according to their weight. So all of the reducers have same load. (1 large key is sent to 5 different reducers whereas 6 small keys sent to single reducer)
option2: Again allocate keys according to their weight but make sure a reducer can only take one key. (1 large key is sent to 5 different, but 6 small keys sent to seperate reducers as well)
Pros & Cons:
- In option 1: I have balanced reducers but some reducers write to different files (small keys).
- In option 2: I have unbalanced reducers, but limited the maximum load on a reducer and each reducer writes to its own file.
Both of these options run in reasonable amount of time. I need some advise about which one I should go with.
Thanks