i have just started learning hadoop,and running hadoop map-reduce program with custom partitioner and comparator(trying it on single node environment first, will later deploy on cluster), the strange behavior(as i don't know what actually is going on) i am observing is that according to my partitioner and comparator, five times reduce method is called, as i corss-checked it from the logs also.However on console, count for Launched reduce tasks is still '1'. I am in a great doubt that if these five function calls are running parallel or not ? And if not then how will i achieve the advantage of distributed computing for these reduce function calls as the data collected by these reduce-function calls will be large. please clarify, what concept i am missing ?
Asked
Active
Viewed 103 times
1 Answers
4
The reduce function is the actual function that is called when joining two pieces of data. The reduce task is a program running on a machine, that executes the reduce function multiple times, serially.
If you want your data to actually be processed in parallel, you have to (manually) launch multiple reduce tasks - hadoop will then divide the work between them.

loopbackbee
- 21,962
- 10
- 62
- 97