8

According to http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/, the formula for determining the number of concurrently running tasks per node is:

min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, 
     yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores) .

However, on setting these parameters to (for a cluster of c3.2xlarges):

yarn.nodemanager.resource.memory-mb = 14336

mapreduce.map.memory.mb = 2048

yarn.nodemanager.resource.cpu-vcores = 8

mapreduce.map.cpu.vcores = 1,

I find I'm only getting up to 4 tasks running concurrently per node when the formula says 7 should be. What's the deal?

I'm running Hadoop 2.4.0 on AMI 3.1.0.

verve
  • 775
  • 1
  • 9
  • 21
  • can you try with http://hadoop.apache.org/docs/r2.4.0/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html use yarn.scheduler.capacity.maximum-am-resource-percent / yarn.scheduler.capacity..maximum-am-resource-percent – Sandesh Deshmane Aug 08 '14 at 05:59
  • CapacityScheduler is for distributing cluster resources among several YARN-based applications while ensuring some minimum capacity for each -- think PBS for YARN. I'm looking for the Hadoop 2.x analogs to Hadoop 1.x's mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum; and before someone says "mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum," these do not work in MapReduce2 because it does away with the TackTracker and the concept of slots -- read the first gotcha from the Cloudera blog post. – verve Aug 08 '14 at 11:36
  • My issue is that Cloudera's formula may work for CDH but doesn't appear to for Hadoop 2.4.0 on EMR. – verve Aug 08 '14 at 11:37
  • 1
    if you check http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_H2.html. it will show default configs for c3.2xlarge. May be this can help to find out if there is memory left to run more processes ( maps) . – Sandesh Deshmane Aug 08 '14 at 16:26
  • 1
    Thanks for the suggestion Sandesh; that is a useful link. I have found empirically that the formula is more like: min (2 / 3 * yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores) on EMR. I wonder if the pmem-to-vmem ratio participates somehow. I guess I could dive into the source to see, but it would be nice to hear from someone with Hadoop 2.x expertise. – verve Aug 08 '14 at 18:23

1 Answers1

1

My empirical formula was incorrect. The formula provided by Cloudera is the correct one and appears to give the expected number of concurrently running tasks, at least on AMI 3.3.1.

verve
  • 775
  • 1
  • 9
  • 21
  • Im not seeing a formula different on that page than what you have listed above. Could you please include the formula that yields the 4 tasks you are seeing? Further do you know if vcores can be set fractional for IO bound tasks? – AaronM May 09 '15 at 00:00
  • Check the comments on the question; the empirical formula I'm referring to is from comment on Aug 8, '14. This is the incorrect formula, and the correct formula is the one from Cloudera. – verve Nov 09 '15 at 16:15