0

I am confused with the following two conflicting notions about mapreduce both arising from the same source:

Is it:

  • reducer side fetches the entire output of (map-combine)er, sorts and then applies the reduce function in one shot. I get this notion from :

However in MapReduce the reducer input data needs to be sorted, so the reduce() logic is applied after the shuffle-sort process. Since Spark does not require a sorted order for the reducer input data, we don't need to wait until all the data gets fetched to start processing.

or, is it,

  • reducer side fetches a pre-specified amount of map-combiner output and then applies the combiner, then receives the next batch and applies the combiner on this next batch and so on and so forth. Then the result of all these combiners are put together, sorted and, fed to the reduce function for final aggregation. I get this notion from

reduce side: Shuffle process in Hadoop will fetch the data until a certain amount, then applies combine() logic, then merge sort the data to feed the reduce() function.

Can you help me understand which one is the correct notion. I have never read anywhere that the combiner runs on the reduce side as well. However, I am not sure of that after reading the blog I hyperlinked earlier

figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56

0 Answers0