Partitioner or MultipleOutputs

Question

I would like to have your opinion regarding Partitioner vs MultipleOutputs.
Suppose I have a file which contains keys as

0:aaa  
1:bbb  
0:ccc  
0:ddd  
...  
1:zzz

I would like have 2 files: one file containing keys starting with 0: and the other containing keys starting with 1:. Which approach should I use:
1) Use a custom Partitioner which will parse the keys and returns 0 or 1 for getPartition().
2) Use MultipleOutputs.write in the reduce phase, by parsing the key and providing zero or one for the namedOutput parameter of MultipleOutputs.write.

Which one is better? For me, 1) is better because reducers deal with a single file.

score 0 · Accepted Answer · answered Dec 01 '13 at 21:10

0

If your job is only to split the input files into 2 parts, then MultipleOutputs is a better bet as you can save on the shuffle / sort phase (by running a map only job).

Now if you have lots of input files and don't want 2x the number of output files as you have input, then using the partitioner based approach will allow you to consolidate the input files into 2 outputs (they won't be nicely named however, another benefit of MultipleOutputs, but you can easily fix this by using MultipleOutputs in your reducer and LaxyOutputFormat to ensure that the empty part-r files won't be written as output).

So to answer - it depends on how many input files you have, and how many output files you want.

answered Dec 01 '13 at 21:10

Chris White

29,949
4
71
93

Hi Chris, thanks for sharing your thought. Actually I tried to simplify the problem. If I have to use the reducer phase. For me custom partitioner and MultipleOutputs inside the reducer accomplish the same thing, however which one offers a better performance. Thx. – JohnRossy Dec 02 '13 at 04:49
If you have to use a reducer, then i would say the difference in performance is probably negligible – Chris White Dec 02 '13 at 11:28

score 0 · Answer 2 · answered Aug 07 '14 at 18:17

0

When you say the first option is better that means you bound by 2 values.. suppose if you get other key value u might need to change your partitioner or cofiguration to set 3 reducers, so better idea is use multipleoutputs

answered Aug 07 '14 at 18:17

Arun Poreddy

11
2

Partitioner or MultipleOutputs

2 Answers2