the process by which the system sort the map output on map side is known as the sort. is this part of shuffle? In other words, when does shuffle start? After the map output has been wrote to disk, or after the map output has been wrote to the buffer in memory
Asked
Active
Viewed 79 times
1 Answers
0
The whole Map-reduce processed is explained at detailed level here: http://ercoppa.github.io/HadoopInternals/AnatomyMapReduceJob.html
To answer your question, the steps in single map task comprises of:
- INIT phase: we setup the Map Task
- EXECUTION phase: for each (key, value) tuple inside the map split we run the map() function
- SPILLING phase: the map output is stored in an in-memory buffer; when this buffer is almost full then we start (in parallel) the spilling phase in order to remove data from it
- SHUFFLE phase: at the end of the spilling phase, we merge all the map outputs and package them for the reduce phase
The Execution and Spilling phase occurs in-parallel. So, data is written in a circular buffer memory -> Sorted in memory -> When buffer is 80% full -> Written to local disk.
At the end of the EXECUTION phase, the SPILLING thread is triggered for the last time. In more detail, we:
- sort and spill the remaining unspilled tuples
- start the SHUFFLE phase
Notice that for each time the buffer was almost full, we get one spill file (SpillReciord + output file). Each Spill file contains several partitions (segments).

Gyanendra Dwivedi
- 5,511
- 2
- 27
- 53