3

The programming model MapReduce consists of 2 procedures, map and reduce. Why do we need the map part, when we can simply do the mapping inside reduce function.

Consider the following pseudocode:

result = my_list.map(my_mapper).reduce(my_reducer);

This could be shortened to

result = my_list.reduce(lambda x : my_reducer(my_mapper(x)));

How can the 1st approach be more preferred than the 2nd one, while the 1st approach requires one more pass through the data? Is my code example oversimplifying?

Fermat's Little Student
  • 5,549
  • 7
  • 49
  • 70

2 Answers2

4

Well, if you refer to Hadoop style MapReduce it is actually map-shuffle-reduce where the shuffle is a reason for map and reduce to be separated. At a little bit higher you can think about data locality. Each key-value pair passed through map can generate zero or more key-value pairs. To be able to reduce these you have to ensure that all values for a given key are available on a single reduce, hence the shuffle. What is important pairs emitted from a single input pair can be processed by different reducers.

It is possible to use patterns like map-side aggregations or combiners but at the end of the day it is still (map)-reduce-shuffle-reduce.

Assuming data locality is not an issue, higher order functions like map and reduce provide an elegant abstraction layer. Finally it is a declarative API. Simple expression like xs.map(f1).reduce(f2) describe only what not how. Depending on a language or context these can be eagerly or lazily evaluated, operations can be squashed, in more complex scenario reordered and optimized in many different ways.

Regarding your code. Even if signatures were correct it wouldn't really reduce number of times you pass over the data. Moreover if you push map into aggregation then arguments passed to aggregation function are not longer of the same type. It means either sequential fold or much more complex merging logic.

zero323
  • 322,348
  • 103
  • 959
  • 935
1

At a high level, map reduce is all about processing in parallel. Even though the reducer work on map output, in practical terms, each reducer will get only part of data, and that is possible only in first approach.

In your second approach, your reducer actually needs entire output of mapper, which beats the idea of parallelism.

Ramzy
  • 6,948
  • 6
  • 18
  • 30
  • I mean, from your code, your reducer actually needs specific(in your case entire map output), and there is no way that different map outputs can go to different reducers, to handle parallelism – Ramzy Nov 03 '15 at 23:07