0

For load balancing reasons, I want to create more partitions than reducers in a Hadoop environment. Is there a way to assign partitions to a specific reducers and if so, where can I define them. I wrote a individual Partitioner and want now to address a specific reducer with specific partitions.

Thank you in advance for the help!

beto8888
  • 45
  • 1
  • 4

2 Answers2

0

The portioning is done for the reducers. As many partitions are created as the number of reducers chosen. You can choose the number of reducers by

job.setNumReduceTasks(n);

The number n need not be limited by the physical reducer number you have. There will only be some wait to get the next reduce slot. In your partitioner code, you can implemengt the logic required to assign a key to specific partition.

However I do not see achieving any efficiency by going beyond the number of physically available reducer slots as it will only result in wait for the next reduce slot.

Rags
  • 1,891
  • 18
  • 19
  • Thank you for your help. The goal of creating more partitions than reducers is, to calculate the size of the individual partitions, and give a reducer more than one partition to give all reducers the same load of work – beto8888 Apr 26 '13 at 09:30
0

Hadoop doesn't lend itself to this kind of control.

as Explained by pg 43-44 of this excellent book. The programmer has little control over:

  1. Where a mapper or reducer runs (i.e., on which node in the cluster).
  2. When a mapper or reducer begins or finishes.
  3. Which input key-value pairs are processed by a specific mapper.
  4. Which intermediate key-value pairs are processed by a specific reducer. (what you would like)

BUT

You can change number 4 by implementing a cleverly designed custom Partitioner that splits your data just the way you want it so that it and distributes your load across reducers as expected. Check out how they implement a custom partitioner to calculate relative frequencies in chapter 3.3.

Community
  • 1
  • 1
Engineiro
  • 1,146
  • 7
  • 10
  • thank you very much for your answer. Is it therefore correct, that I won't be able to analyze the data during the map-function and after all mappers are done, calculate the distribution of the data, and distribute them afterwards with an individual partitioner, which is just build after all map functions are done and the specific distribution (according to the input) of my data. – beto8888 Apr 26 '13 at 11:53
  • Unfortunately, Hadoop doesn't allow that kind of control. There may be something in the works in the next Hadoop YARN and MR2 since it is a major overhaul but I am not aware of this today. If I've answered the question to your satisfaction, please accept my answer. – Engineiro Apr 26 '13 at 11:58
  • user2323063, actually you can sample you data by running maps on portions of data, then place calculated splits on to distributed cache. how it can be done you can see in TeraSort imlementation http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html – octo Apr 27 '13 at 05:42
  • They implement a custom partitioner to split up the data to ensure reducers down the line output more data and the partitioner knows the key space. A version of what the OP wants can be done with a clever partitioner but it can't be done if the requirements are strict in 1)analyze the data and when all mappers are done 2)calculate the distribution of the data 3) distribute them with an individual partitioner – Engineiro Apr 27 '13 at 13:05