Summary from the book "hadoop definitive guide - tom white" is:
All the logic between user's map function and user's reduce function is called shuffle. Shuffle then spans across both map and reduce. After user's map() function, the output is in in-memory circular buffer. When the buffer is 80% full, the background thread starts to run. The background thread will output the buffer's content into a spill file. This spill file is partitioned by key. And within each partition, the key-value pairs are sorted by key.After sorting, if combiner function is enabled, then combiner function is called. All spill files will be merged into one MapOutputFile. And all Map tasks's MapOutputFile will be collected over network to Reduce task. Reduce task will do another sort. And then user's Reduce function will be called.
So the questions are:
1.) According to the above summary, this is the flow:
Mapper--Partioner--Sort--Combiner--Shuffle--Sort--Reducer--Output
1a.) Is this the flow or is it something else?
1b.) Can u explain the above flow with an example say word count example, (the ones I found online weren't that elaborative) ?
2.) So the mappers phase output is one big file (MapOutputFile)? And it is this one big file that is broken into and the key-value pairs are passed onto the respective reducers?
3.) Why does the sorting happens for a second time, when the data is already sorted & combined when passed onto their respective reducers?
4.) Say if mapper1 is run on Datanode1 then is it necessary for reducer1 to run on the datanode1? Or it can run on any Datanode?