1

I have a doubt about combiner functioning in Hadoop Map/Reduce Framework. The combiner operation is applied only on key-value pairs output by a map task or on all map tasks occurring on a given node. In fact, i have done some tests and it seems to be the first one. If I'm right, according to you, why this behavior has been chosen knowing that combining all map tasks outputs can be very beneficial to decrease bandwidth use.

thanks in advance

SlimShady
  • 43
  • 8

1 Answers1

0
  • How does it know when all the map tasks will be complete? The TaskTracker doesn't know how the JobTracker will assign map tasks. You would probably have to wait for all the map tasks to be complete before running the combiners.
  • You still want to keep the data flow between mappers and reducers moving. As combiners run and output is created, that data starts getting shuffled to the reducers right away (barring slowstart configuration set to something high). This is good because it spreads out the network utilization over time.
Donald Miner
  • 38,889
  • 8
  • 95
  • 118
  • Thank you very much for you answer i think i get it. I am really lucky having you answering my question. In fact i was trying to apply the design pattern you 've explained in your book ^^ (a really useful book by the way) : average example in summarizations patterns. – SlimShady Oct 24 '13 at 19:42
  • My HADOOP job goal is to make an average on all values having a specific key and this for numerous existing keys. I'm facing a situation where each map returns several key-value pairs but with different keys so not possible to do a 'local average' on the map operation outputs by applying a combiner. I think the solution for my problem is to chain two jobs one to do the average on several parts of the data (the number of key-value pairs for a given key is huge for one reducer) and a second job to do the global average. What do you think?? – SlimShady Oct 24 '13 at 19:42