How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

Question

According to http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/, the formula for determining the number of concurrently running tasks per node is:

min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, 
     yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores) .

However, on setting these parameters to (for a cluster of c3.2xlarges):

yarn.nodemanager.resource.memory-mb = 14336

mapreduce.map.memory.mb = 2048

yarn.nodemanager.resource.cpu-vcores = 8

mapreduce.map.cpu.vcores = 1,

I find I'm only getting up to 4 tasks running concurrently per node when the formula says 7 should be. What's the deal?

I'm running Hadoop 2.4.0 on AMI 3.1.0.

can you try with http://hadoop.apache.org/docs/r2.4.0/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html use yarn.scheduler.capacity.maximum-am-resource-percent / yarn.scheduler.capacity..maximum-am-resource-percent — Sandesh Deshmane, Aug 08 '14 at 05:59
CapacityScheduler is for distributing cluster resources among several YARN-based applications while ensuring some minimum capacity for each -- think PBS for YARN. I'm looking for the Hadoop 2.x analogs to Hadoop 1.x's mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum; and before someone says "mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum," these do not work in MapReduce2 because it does away with the TackTracker and the concept of slots -- read the first gotcha from the Cloudera blog post. — verve, Aug 08 '14 at 11:36
My issue is that Cloudera's formula may work for CDH but doesn't appear to for Hadoop 2.4.0 on EMR. — verve, Aug 08 '14 at 11:37
if you check http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_H2.html. it will show default configs for c3.2xlarge. May be this can help to find out if there is memory left to run more processes ( maps) . — Sandesh Deshmane, Aug 08 '14 at 16:26
Thanks for the suggestion Sandesh; that is a useful link. I have found empirically that the formula is more like: min (2 / 3 * yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores) on EMR. I wonder if the pmem-to-vmem ratio participates somehow. I guess I could dive into the source to see, but it would be nice to hear from someone with Hadoop 2.x expertise. — verve, Aug 08 '14 at 18:23

score 1 · Accepted Answer · answered Dec 16 '14 at 20:36

1

My empirical formula was incorrect. The formula provided by Cloudera is the correct one and appears to give the expected number of concurrently running tasks, at least on AMI 3.3.1.

answered Dec 16 '14 at 20:36

verve

775
1
9
21

Im not seeing a formula different on that page than what you have listed above. Could you please include the formula that yields the 4 tasks you are seeing? Further do you know if vcores can be set fractional for IO bound tasks? – AaronM May 09 '15 at 00:00
Check the comments on the question; the empirical formula I'm referring to is from comment on Aug 8, '14. This is the incorrect formula, and the correct formula is the one from Cloudera. – verve Nov 09 '15 at 16:15

How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

1 Answers1

Linked