Is sort part of shuffle in mapreduce

Question

the process by which the system sort the map output on map side is known as the sort. is this part of shuffle? In other words, when does shuffle start? After the map output has been wrote to disk, or after the map output has been wrote to the buffer in memory

score 0 · Accepted Answer · answered Feb 07 '18 at 20:15

The whole Map-reduce processed is explained at detailed level here: http://ercoppa.github.io/HadoopInternals/AnatomyMapReduceJob.html

To answer your question, the steps in single map task comprises of:

INIT phase: we setup the Map Task
EXECUTION phase: for each (key, value) tuple inside the map split we run the map() function
SPILLING phase: the map output is stored in an in-memory buffer; when this buffer is almost full then we start (in parallel) the spilling phase in order to remove data from it
SHUFFLE phase: at the end of the spilling phase, we merge all the map outputs and package them for the reduce phase

The Execution and Spilling phase occurs in-parallel. So, data is written in a circular buffer memory -> Sorted in memory -> When buffer is 80% full -> Written to local disk.

At the end of the EXECUTION phase, the SPILLING thread is triggered for the last time. In more detail, we:

sort and spill the remaining unspilled tuples
start the SHUFFLE phase

Notice that for each time the buffer was almost full, we get one spill file (SpillReciord + output file). Each Spill file contains several partitions (segments).

Is sort part of shuffle in mapreduce

1 Answers1