Which protocol is used in Hadoop to copy the data from Mappers to Reducers?

Question

I have some doubt regarding the transfer protocols being used by Hadoop framework to copy the mapper output(which is stored locally on mapper node) to the reducers task (which is not running on same node). - read some blogs that it uses HTTP for Shuffle phase - also read that HDFS data transfers(used by mapreduce jobs) are done using TCP/IP sockets directly. - read about RPC in Hadoop The Definitive guide.

Any pointers/reference will be of great help.

score 4 · Answer 1 · answered Feb 24 '17 at 14:24

Hadoop uses HTTPServlets for intermediate data shuffling. See Figure below (taken from JVM-Bypass for Efficient Hadoop Shuffling by Wang et al.):

For careful treat have a look at 'JVM-Bypass for Efficient Hadoop Shuffling' work published in 2013 (full-text available).

Which protocol is used in Hadoop to copy the data from Mappers to Reducers?

1 Answers1