Apache hadoop combiners

Question

What are the use cases where combiners are apt and what are the use cases where the combiners are not apt. I am aware of the functionality of combiner but i am trying to understand different use cases where combiners make sense.

Have a look at this question: http://stackoverflow.com/questions/33406566/combiner-inplementation-and-internal-working/33408776#33408776 — Ravindra babu, Nov 02 '15 at 09:49

score 0 · Answer 1 · answered Nov 02 '15 at 16:36

source: Hadoop the definitive guide:

Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

If there are only one or two spills, the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.

What is spill: Each map task has a circular memory buffer that it writes the output to. When the contents of the buffer reaches a certain threshold size (80%), a background thread will start to spill the contents to disk.

IMO always run combiner if the combiner fits the criteria (commutative and associative) . Hadoop framework will decide whether to run combiner or not (based on map output size/no of spills.) so you don't have to worry about performance reduction.

score 0 · Answer 2 · answered Nov 02 '15 at 17:29

Normal Map output for word count example while processing file below is
file1 :
this is a book
this is a bookshelf

Map o/p :
this 1
is 1
a 1
book 1
this 1
is 1
a 1
bookshelf 1

Now in order to avoid such a huge data transfer in network,combiner is used, which is a normal reducer code, so if we will write custom combiner the map o/p will be :

this 1,1
is 1,1
a 1,1
book 1
bookshelf 1

so less data transfer in network to reducer node.

2.About decreasing performance : Now in the above example if the total line in file is very large, than to avoid large data transfer combiner is useful, but if its only total lines is 2, than combiner will add its overhead of execution.

score 0 · Answer 3 · answered Nov 03 '15 at 11:24

Combiners are primary used to decrease the amount of data needed to be processed by Reducers. They are called mini reducers.

One use case to explain better:

Output from Mapper which is input to Reducer in absence of Combiner

<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

Output from Mapper -> Combiner, which is input to Reducer with Combiner function

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>

You can obviously see the amount of data transfer reduction with the use of combiner even in this small example. Imagine the scenario of million words with tera bytes of data and you can see huge network bandwidth savings.

When to Use Combiner?

You can use the Combiner for word count example.

Combiners can only be used on the functions that are commutative (a.b = b.a) and associative (a.(b.c) = (a.b).c) .

When you should not use Combiner?

Simple. If above case is not valid. e.g. Replace the word count example with calculation of mean (average) age from the list of employees. If you pass all the values from Mapper to Reducer, you will get different mean (average) of age. If you send subset of data from individual mappers, you will get different mean of age.

Differences between Combiner and Reducer can be checked here and

When not to use combiner can be checked here

Apache hadoop combiners

3 Answers3