How to set the number of parallel reducers on EMR?

Question

I am running a job on EMR with mrjob; I am using AMI version 2.4.7 and Hadoop version 1.0.3.

I want to specify the number of reducers for a job, because I want to provide a higher parallellism to the next one. Reading the answers to the other questions on this site, I gathered that I should set these parameters, and so I did: mapred.reduce.tasks=576 mapred.tasktracker.reduce.tasks.maximum=24

However, it seems like the second option is not picked up: both the EMR and the Hadoop interfaces report that there are 576 reduce tasks to run, but the capacity of the cluster remains at 72 (r3.8xlarge instances).

I even see that the option is set in var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_XXX/job.xml:<property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>24</value></property>. Still, only the default number (9) of actual reducers are running at the same time.

Why is the option not picked up by EMR? Or is there a different way to force a higher number of reducers on an instance?

score 2 · Accepted Answer · answered Feb 26 '15 at 21:21

With Hadoop 1, the map and reduce slots per node are set at the daemon level and thus require a restart of the TaskTracker daemons if the value is changed.

On EMR, the default number of slots per instance type can be found at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_H1.0.3.html.

In order to change these default values you will need to use a bootstrap action like configure-hadoop to modify the mapred.tasktracker.reduce.tasks.maximum on the cluster before Hadoop daemons start. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop for more details.

Example (will need to be modified to match whatever interface is being used to create the cluster):

s3://<region>.elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.tasktracker.reduce.tasks.maximum=24

Please note if changing the number of slots per node be sure to adjust mapred.child.java.opts to provide an upper memory amount that is reasonable for the amount of memory available.

Thanks @ChristopherB, this was exactly the answer I was looking for. Of course, after I knew what the problem was, I also found it in the [mrjob documentation](http://mrjob.readthedocs.org/en/latest/guides/emr-advanced.html): _Some Hadoop options, such as the maximum number of running map tasks per node, must be set at bootstrap time and will not work with –jobconf._ But do I understand correctly that this changed with Hadoop 2, and in it, this option works when specified in the _jobconf_? — David Nemeskey, Mar 02 '15 at 12:30
Yes, this changed with Hadoop 2 and YARN. There is no max map/reduce slots per node as with Hadoop 1. The number of concurrent tasks per node is driven by how much memory per container is requested and available per node which can be changed as part of JobConf. In Hadoop 2 MapReduce you would adjust the mapreduce.reduce.memory.mb and mapreduce.reduce.java.opts properties. The default configs are at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html. — ChristopherB, Mar 02 '15 at 23:58

How to set the number of parallel reducers on EMR?

1 Answers1