Number of concurrently running mappers per node drops precipitously on Elastic MapReduce w/ AMI 3.1.0 and Hadoop 2.4.0 as cluster size increases

Question

In a related question (How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce), I ask for formulas relating the number of concurrently running mappers/reducers to YARN and MR2 memory parameters. It turns out that on Elastic MapReduce, when my cluster has between 2 and 10 c3.2xlarge nodes, variations of the formulas mentioned there work okay, giving me 7-9 concurrently running mappers per node; but when the number of c3.2xlarges is 20 or 40, I get cluster underutilization: only 1-4 mappers run per node. Since my job is CPU-bound, this is particularly awful: MR2 delivers _half_the performance of MR1 for me.

Why is this happening?

score 1 · Accepted Answer · answered Aug 11 '14 at 01:43

You will be limited from what the NameNode can dish out. You can and should specific a larger instance type for the NameNode when increase your Task nodes as such. The MR1 page was never updated for c3s http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration.html

Number of concurrently running mappers per node drops precipitously on Elastic MapReduce w/ AMI 3.1.0 and Hadoop 2.4.0 as cluster size increases

1 Answers1