0

i have just started learning hadoop,and running hadoop map-reduce program with custom partitioner and comparator(trying it on single node environment first, will later deploy on cluster), the strange behavior(as i don't know what actually is going on) i am observing is that according to my partitioner and comparator, five times reduce method is called, as i corss-checked it from the logs also.However on console, count for Launched reduce tasks is still '1'. I am in a great doubt that if these five function calls are running parallel or not ? And if not then how will i achieve the advantage of distributed computing for these reduce function calls as the data collected by these reduce-function calls will be large. please clarify, what concept i am missing ?

Bruce_Wayne
  • 1,564
  • 3
  • 18
  • 41

1 Answers1

4

The reduce function is the actual function that is called when joining two pieces of data. The reduce task is a program running on a machine, that executes the reduce function multiple times, serially.

If you want your data to actually be processed in parallel, you have to (manually) launch multiple reduce tasks - hadoop will then divide the work between them.

loopbackbee
  • 21,962
  • 10
  • 62
  • 97