Using a combiner in hadoop streaming mapreduce (using mrjob)

Question

When I was taught about mapreduce one of the key components was the combiner. It is a step between the mapper and the reducer which essentially runs the reducer at the end of the map phase in order to decrease the number of lines of data that the mapper is outputting. As the size of the data I need to process increases (at the muti-terabyte scale), the reduce step becomes prohibitively slow. I talked to a friend of mine and he says that this has been his experience too, and that instead of using a combiner, he partitions his reduce key using a hash function which reduces the number of values that go to each key in the reduce step. I tried this and it worked. Has anyone else had this experience with the combiner step not scaling well, and why can't I find any documentation of this problem as well as the workaround? I'd rather not use a workaround if there is a way to make the combiner step scale.

[EDIT] Here is an example of the workaround my friend suggested which works much faster than a combiner:

Instead of outputting word, count

The mapper outputs (word, hash(timestamp) % 1024), count

Then there are 2 reduce steps to merge the results of the mapper.

score 0 · Answer 1 · answered Sep 03 '14 at 04:59

0

I think these are distinct things, no workaround.

A Combiner is only applicable if the computation in question allows for partial reduction, that is, emitting an intermediate result from part of the tuples for the key, and using the reduce step to combine the intermediate results. Summing or averaging values go easily with a combiner, other algorithms do not.
Performance (scalability)of the Reduce step greatly depends on how well the key partition function maps unique map output keys to reducer slots. A good partitioner should assign each reducer worker the same workload.

As the size of the data I need to process increases (at the muti-terabyte scale), the reduce step becomes prohibitively slow

There is nothing inherent about the MR paradigm that makes the Reduce step not scale. Your algorithm may not map well to MapReduce though. If you can provide more information on what specifically you are doing, we can help figure out.

he partitions his reduce key using a hash function which reduces the number of values that go to each key in the reduce step

This does not make sense to me. Using a hash function on the keys can only increase the number of values that go to each bucket.

answered Sep 03 '14 at 04:59

Alexander Gessler

45,603
7
82
122

My computation does allow for partial computation, and the workaround I described does a partial computation, just as a separate reduce step rather than in the combine step. I want to know why when I do this as a combiner, it is so slow, but when I partition the reduce keys, it seems to work fine. – Narek Sep 03 '14 at 05:06
Here is an example: instead of outputting `word, count` the mapper outputs `(word, hash(timestamp) % 1024), count` and then there is an extra reduce step to add up the the words within a partition. I was skeptical of this approach when I first saw it, but somehow it is MUCH faster than using a combiner. – Narek Sep 03 '14 at 05:07
And afterwards you aggregate the partial results coming from the reducers again? – Alexander Gessler Sep 03 '14 at 05:23
Yes, so a word count example would have two reducers. – Narek Sep 03 '14 at 05:25
Maybe you can merge this info in the question. I do not see why this would be faster, I am curious though to find out. – Alexander Gessler Sep 03 '14 at 05:33

Using a combiner in hadoop streaming mapreduce (using mrjob)

1 Answers1