When I was taught about mapreduce one of the key components was the combiner. It is a step between the mapper and the reducer which essentially runs the reducer at the end of the map phase in order to decrease the number of lines of data that the mapper is outputting. As the size of the data I need to process increases (at the muti-terabyte scale), the reduce step becomes prohibitively slow. I talked to a friend of mine and he says that this has been his experience too, and that instead of using a combiner, he partitions his reduce key using a hash function which reduces the number of values that go to each key in the reduce step. I tried this and it worked. Has anyone else had this experience with the combiner step not scaling well, and why can't I find any documentation of this problem as well as the workaround? I'd rather not use a workaround if there is a way to make the combiner step scale.
[EDIT] Here is an example of the workaround my friend suggested which works much faster than a combiner:
Instead of outputting word, count
The mapper outputs (word, hash(timestamp) % 1024), count
Then there are 2 reduce steps to merge the results of the mapper.