The short answer is : when use mapper to generate data, Hadoop have to copy the data from mapper to redcuer, this cost too much time.
result total data size
The total data generated is O(n^2)
.
comparesion of data generation by mapper VS reducer
If you generate n*(n-1)/2
pairs using mapper, the intermediate data have to be copied to the reducer. This step in Hadoop is named Shuffle Phase. and reducer will still need to put these data to HDFS. The total data read/write from the Harddisk in your cause during the shuffle phase can be 6* sizeof(intermediate data)
, which is very large.
while if the data is generated by the reducer, the O(n^2)
intermediate data transformation is unnecessary. So it could have a better performance.
So your performance issue is mainly caused by data transformation, not computation. And if no disk-access, the mapper and reducer just have the same performance.
ways to improve performance of the mapper data generation strategy
If you still want to use mapper to generate the data, maybe the io.sort.factor
, turn on compression may help improve the performance.