Difference between Launched reduce tasks and number of times reduces function called?

Question

i have just started learning hadoop,and running hadoop map-reduce program with custom partitioner and comparator(trying it on single node environment first, will later deploy on cluster), the strange behavior(as i don't know what actually is going on) i am observing is that according to my partitioner and comparator, five times reduce method is called, as i corss-checked it from the logs also.However on console, count for Launched reduce tasks is still '1'. I am in a great doubt that if these five function calls are running parallel or not ? And if not then how will i achieve the advantage of distributed computing for these reduce function calls as the data collected by these reduce-function calls will be large. please clarify, what concept i am missing ?

score 4 · Answer 1 · answered Sep 26 '14 at 17:19

The reduce function is the actual function that is called when joining two pieces of data. The reduce task is a program running on a machine, that executes the reduce function multiple times, serially.

If you want your data to actually be processed in parallel, you have to (manually) launch multiple reduce tasks - hadoop will then divide the work between them.

Difference between Launched reduce tasks and number of times reduces function called?

1 Answers1