1

My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?

I haven't found the answer neither in Hadoop Streaming documentation, nor in Amazon Elastic MapReduce FAQ.

lithuak
  • 6,028
  • 9
  • 42
  • 54

3 Answers3

1

Hadoop has a notion of "slots". Slot is a place where mapper process will run. You configure number of slots per tasktracker node. It is teoretical maximum of map process which will run parralel per node. It can be less if there is not enough separate poprtions of the input data (called FileSplits).
Elastic MapReduce do have its own estimation how much slots to allocate depending on the instance capabilities.
In the same time I can imagine scenario where your processing will be more efficeint when one datastream is prcessed by many cores. If you have your mapper with built-in multicore usage - you can reduce number of slots. But it is inot usually a case in the typycial Hadoop tasks.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30
1

See the EMR doco [1] for the number of map/reduce tasks per instance type.

In addition to David's answers you can also have Hadoop run multiple threads per map slot by setting...

conf.setMapRunnerClass(MultithreadedMapRunner.class);  

The default is 10 threads but it's tunable with

-D mapred.map.multithreadedrunner.threads=5

I often find this useful for custom high IO stuff.

[1] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_AMI2.html

mat kelcey
  • 3,077
  • 2
  • 30
  • 35
-1

My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?

Once a Hadoop cluster has been set, the minimum required to submit a job is

  • Input format and location
  • Output format and location
  • Map and Reduce functions for processing the data
  • Location of the NameNode and the JobTracker

Hadoop will take care of distributing the job to different nodes, monitoring them, reading the data from the i/p and writing the data to the o/p. If the user has to do all those tasks, then there is no point of using Hadoop.

Suggest, to go through the Hadoop documentation and a couple of tutorials.

Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
  • The question was not about distributing job to different nodes but about running several jobs in one node. – lithuak Feb 03 '12 at 14:46