3

I have tried the following combinations of bootstrap actions to increase the heap size of my job but none of them seem to work:

--mapred-key-value mapred.child.java.opts=-Xmx1024m 
--mapred-key-value mapred.child.ulimit=unlimited

--mapred-key-value mapred.map.child.java.opts=-Xmx1024m 
--mapred-key-value mapred.map.child.ulimit=unlimited

-m mapred.map.child.java.opts=-Xmx1024m
-m mapred.map.child.ulimit=unlimited 

-m mapred.child.java.opts=-Xmx1024m 
-m mapred.child.ulimit=unlimited 

What is the right syntax?

Steffen Opel
  • 63,899
  • 11
  • 192
  • 211
Shrish Bajpai
  • 319
  • 2
  • 6
  • 18

2 Answers2

7

You have two options to achieve this:

Custom JVM Settings

In order to apply custom settings, You might want to have a look at the Bootstrap Actions documentation for Amazon Elastic MapReduce (Amazon EMR), specifically action Configure Daemons:

This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.

An example is provided as well, which sets the heap size to 2048 and configures the Java namenode option:

$ ./elastic-mapreduce –create –alive \
  --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
  --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19   

Predefined JVM Settings

Alternatively, as per the FAQ How do I configure Hadoop settings for my job flow?, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup - this refers to action Configure Memory-Intensive Workloads, which allows you to set cluster-wide Hadoop settings to values appropriate for job flows with memory-intensive workloads, for example:

$ ./elastic-mapreduce --create \
--bootstrap-action \
  s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

The specific configuration settings applied by this predefined bootstrap action are listed in Hadoop Memory-Intensive Configuration Settings.

Good luck!

Steffen Opel
  • 63,899
  • 11
  • 192
  • 211
  • Thanks Steffen, based on the documentation I have tried the above arguments on the "configure-hadoop" bootstrap script, but is not working. It would be great if you can give me the exact command for setting the "mapred.child.java.opts" heap size in mapred-site.xml configuration file of Hadoop – Shrish Bajpai Apr 05 '12 at 08:15
  • Thanks Steffen, I am able to set the other settings listed "Hadoop Memory-Intensive Configuration Settings" except for "mapred.child.java.opts", thats why I am requesting for the exact argument/command. – Shrish Bajpai Apr 05 '12 at 08:24
  • @ShrishBajpai: You may want to try the alternative and supposedly easier approach first (see my updated answer), before diving deeper (this might offer some insight regarding your question as well); if you really need custom settings, I may look into this some more later on, but will be out of office for a couple of hours now, sorry. – Steffen Opel Apr 05 '12 at 08:24
  • @ShrishBajpai: Just reread your first comment - the listed `--namenode-heap-size=2048` option is to be used specifically with the _configure-daemons_ bootstrap action and won't work with the _configure-hadoop_ bootstrap action (as you already found out ;) – Steffen Opel Apr 05 '12 at 08:35
  • Thanks Steffen, but increasing the name ode heap size alone won't help. To confirm this I reran the job on m.xlarge instance with the following configure-hadoop bootstrap action: --jobtracker-heap-size=3072 --namenode-heap-size=1024 --tasktracker-heap-size=512 --datanode-heap-size=no tracker and it failed with heap error again. To overcome this I need to be able to increase the mapred.child.java.opts heap size which I have tried on my local using native hadoop setup – Shrish Bajpai Apr 05 '12 at 10:11
  • @ShrishBajpai: You may want to read my previous comment again - `--namenode-heap-size=1024` **does not work with *configure-hadoop***, rather only with *configure-daemon.* To add some useful information from [Bootstrap Action Basics](http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapUses) eventually: _You can specify up to 16 bootstrap actions per job flow by providing multiple --bootstrap-action parameters from the CLI or API._ – Steffen Opel Apr 05 '12 at 11:35
0

Steffen's answer is good and works. On the other hand if you just want something quick-and-dirty and just want to replace one or two variables, then you're probably looking to just change it via the command line like the following:

elastic-mapreduce --create \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
  --args "-m,mapred.child.java.opts=-Xmx999m"

I've seen another documentation, albeit an older one, that simply quotes the entire expression within one quote like the following:

--bootstrap-action "s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m \
    mapred.child.java.opts=-Xmx999m"    ### I tried this style, it no longer works!

At any rate, this is not easily found in the AWS EMR documentation. I suspect that mapred.child.java.opts is one of the most overridden variables-- I was also looking for an answer when I got a GC error: "java.lang.OutOfMemoryError: GC overhead limit exceeded" and stumbled on this page. The default of 200m is just too small (documentation on defaults).

Good luck!

Kei-ven
  • 432
  • 3
  • 10