Does Amazon Elastic Map Reduce runs one or several mapper processes per instance?

Question

My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?

I haven't found the answer neither in Hadoop Streaming documentation, nor in Amazon Elastic MapReduce FAQ.

score 1 · Accepted Answer · answered Feb 03 '12 at 12:37

Hadoop has a notion of "slots". Slot is a place where mapper process will run. You configure number of slots per tasktracker node. It is teoretical maximum of map process which will run parralel per node. It can be less if there is not enough separate poprtions of the input data (called FileSplits).
Elastic MapReduce do have its own estimation how much slots to allocate depending on the instance capabilities.
In the same time I can imagine scenario where your processing will be more efficeint when one datastream is prcessed by many cores. If you have your mapper with built-in multicore usage - you can reduce number of slots. But it is inot usually a case in the typycial Hadoop tasks.

Thanks, David, this is exactly the concept and term I've been seeking for. — lithuak, Feb 03 '12 at 14:47

mat kelcey · Answer 2 · 2012-03-16T05:07:09.987

See the EMR doco [1] for the number of map/reduce tasks per instance type.

In addition to David's answers you can also have Hadoop run multiple threads per map slot by setting...

conf.setMapRunnerClass(MultithreadedMapRunner.class);

The default is 10 threads but it's tunable with

-D mapred.map.multithreadedrunner.threads=5

I often find this useful for custom high IO stuff.

[1] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_AMI2.html

score -1 · Answer 3 · answered Feb 03 '12 at 08:48

My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?

Once a Hadoop cluster has been set, the minimum required to submit a job is

Input format and location
Output format and location
Map and Reduce functions for processing the data
Location of the NameNode and the JobTracker

Hadoop will take care of distributing the job to different nodes, monitoring them, reading the data from the i/p and writing the data to the o/p. If the user has to do all those tasks, then there is no point of using Hadoop.

Suggest, to go through the Hadoop documentation and a couple of tutorials.

The question was not about distributing job to different nodes but about running several jobs in one node. — lithuak, Feb 03 '12 at 14:46

Does Amazon Elastic Map Reduce runs one or several mapper processes per instance?

3 Answers3