Why in Hadoop reduce_input_records less than combine_output_records?

Question

I run the example of WordCount with a combiner. Here's the running result:

13/10/07 22:32:38 INFO mapred.JobClient:     Map input records=20111076
13/10/07 22:32:38 INFO mapred.JobClient:     Reduce shuffle bytes=467280
13/10/07 22:32:38 INFO mapred.JobClient:     Spilled Records=541137
13/10/07 22:32:38 INFO mapred.JobClient:     Map output bytes=632287974
13/10/07 22:32:38 INFO mapred.JobClient:     Total committed heap usage (bytes)=4605870080
13/10/07 22:32:38 INFO mapred.JobClient:     Combine input records=62004735
13/10/07 22:32:38 INFO mapred.JobClient:     SPLIT_RAW_BYTES=2280
13/10/07 22:32:38 INFO mapred.JobClient:     Reduce input records=32020
13/10/07 22:32:38 INFO mapred.JobClient:     Reduce input groups=1601
13/10/07 22:32:38 INFO mapred.JobClient:     Combine output records=414658
13/10/07 22:32:38 INFO mapred.JobClient:     Reduce output records=1601
13/10/07 22:32:38 INFO mapred.JobClient:     Map output records=61622097

I have two questions, why map_output_records is less than the combine_input_records? And why the reduce_input_records is much less than the combine_output_records? I know that the combiner might be called several times, but shouldn't the figure for combine_output_records be the last call's result? Why it's not equal to the # of records that reducers consume?

Thanks for any help!

score 0 · Answer 1 · answered Oct 09 '13 at 08:07

The combiner is not always called, you actually have no control on whether it is called or not (or how many times), this is for the framework to decide. This probably explains the numbers. It seems the combiner did a great job though:

Map output records=61622097    ->  Reduce input records=32020

Why in Hadoop reduce_input_records less than combine_output_records?

1 Answers1