0

I have to generate n*(n-1)/2 candidate pairs, from a list of n candidates.

This can be done in every mapper instance or in every reducer instance.

But I observed that, when this operation was done in Reduce phase it was way faster than done in the Map Phase. What is the reason?

Can Mappers not support heavy computation?

What is the impact of a Mapper instance doing such a computation on the network?

Thanks!

1 Answers1

0

The short answer is : when use mapper to generate data, Hadoop have to copy the data from mapper to redcuer, this cost too much time.

result total data size

The total data generated is O(n^2).

comparesion of data generation by mapper VS reducer

If you generate n*(n-1)/2 pairs using mapper, the intermediate data have to be copied to the reducer. This step in Hadoop is named Shuffle Phase. and reducer will still need to put these data to HDFS. The total data read/write from the Harddisk in your cause during the shuffle phase can be 6* sizeof(intermediate data), which is very large.

while if the data is generated by the reducer, the O(n^2) intermediate data transformation is unnecessary. So it could have a better performance.

So your performance issue is mainly caused by data transformation, not computation. And if no disk-access, the mapper and reducer just have the same performance.

ways to improve performance of the mapper data generation strategy

If you still want to use mapper to generate the data, maybe the io.sort.factor, turn on compression may help improve the performance.

Kun Ling
  • 2,211
  • 14
  • 22
  • Thanks for the reply! Mine is a critical condition where, this computation can be done in the reduce phase of the previous stage or the mapper of the current stage, and these candidate pairs are needed for the reducer of the current stage. So, if I do this in the mapper of the current stage, I will avoid writing the pairs to teh HDFS one more time. In any case, thesr pairs will have t pass through the shuffle.. I even set the num of reducers to zero, inorder to test the performance of map, as the output from mapper will be written directly written to hdfs involving no shuffle. – Mahalakshmi Lakshminarayanan Jun 05 '13 at 13:21
  • By doing so, I observed that still Map performed way slower than reduce. Donno y! – Mahalakshmi Lakshminarayanan Jun 05 '13 at 13:27
  • What's the size of your cluster? – SSaikia_JtheRocker Jun 06 '13 at 12:07