1

In a Hadoop job, which node does the sorting/shuffling phase? Does increasing the memory of that node improve the performance of sorting/shuffling?

HHH
  • 6,085
  • 20
  • 92
  • 164

2 Answers2

2

The relevant - in my experience - parameters to tune in mapred.site.xml are:

  • io.sort.mb This is the output buffer of a mapper. When this buffer is full the data is sorted and spilled to disk. Ideally you avoid having to many spills. Note that this memory is part of the maptask heap size.
  • mapred.map.child.java.opts This is the heap size of a map task, the higher this is the higher you can put output buffer size.
  • In principle the number of reduce tasks also influences the shuffle speed. The number of reduce rounds is the total number of reduce slots / the number of reduce tasks. Note that the initial shuffle (during the map phase) will only shuffle data to the active reducers. So mapred.reduce.tasks is also relevant.
  • io.sort.factor is the number threads performing the merge sort, both on the map and the reduce side.
  • Compression also has a large impact (it speeds up the transfer from mapper to reducer but the compr/decompr comes at a cost!
  • mapred.job.shuffle.input.buffer.percent is the percentage of the reducer's heap to store map output in memory.

There are without any doubt more tuning opportunities, but these are the ones I spent quite some time playing around with.

DDW
  • 1,975
  • 2
  • 13
  • 26
  • Thanks. How can I change the ``mapred.map.child.java.opts`` within the code? As I node have access to config files of the Hadoop cluster. – HHH Oct 30 '13 at 19:54
  • Configuration conf = new Configuration(); conf.set("mapred.child.java.opts", "Desired Heap Size"); Job job = new Job(conf); – Thejas Oct 31 '13 at 06:54
1

Sort And Shuffle Phase is divided among the Mappers and Reducers. That is the reason we seen the Reduce % increasing(Usually till 33%) while the Mapper is still Running.

Increasing the sort buffer memory and the performance gain from that will depend on:

a)The size/total Number of the Keys being emitted by the mapper

b) The Nature of the Mapper Tasks : (IO intensive, CPU intensive)

c) Available Primary Memory, Map/Reduce Slots(occupied) in the given Node

d) Data skewness

You can find more information @ https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort

Thejas
  • 61
  • 3