My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
Once a Hadoop cluster has been set, the minimum required to submit a job is
- Input format and location
- Output format and location
- Map and Reduce functions for processing the data
- Location of the NameNode and the JobTracker
Hadoop will take care of distributing the job to different nodes, monitoring them, reading the data from the i/p and writing the data to the o/p. If the user has to do all those tasks, then there is no point of using Hadoop.
Suggest, to go through the Hadoop documentation and a couple of tutorials.