9

My hadoop version is 1.0.2. Now I want at most 10 map tasks running at the same time. I have found 2 variable related to this question.

a) mapred.job.map.capacity

but in my hadoop version, this parameter seems abandoned.

b) mapred.jobtracker.taskScheduler.maxRunningTasksPerJob (http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.0.2/mapred-default.xml)

I set this variable like below:

Configuration conf = new Configuration();
conf.set("date", date);
conf.set("mapred.job.queue.name", "hadoop");
conf.set("mapred.jobtracker.taskScheduler.maxRunningTasksPerJob", "10");

DistributedCache.createSymlink(conf);
Job job = new Job(conf, "ConstructApkDownload_" + date);
...

The problem is that it doesn't work. There is still more than 50 maps running as the job starts.

After looking through the hadoop document, I can't find another to limit the concurrent running map tasks. Hope someone can help me ,Thanks.

=====================

I hava found the answer about this question, here share to others who may be interested.

Using the fair scheduler, with configuration parameter maxMaps to set the a pool's maximum concurrent task slots, in the Allocation File (fair-scheduler.xml). Then when you submit jobs, just set the job's queue to the according pool.

HaiWang
  • 187
  • 2
  • 4
  • 12
  • Why are you trying to do this? If the motivation is a fair distribution of resources on your cluster you should try using FairScheduler. – mohit6up Jan 17 '13 at 14:15
  • 2
    because in map phase, I will read something from external data source. I don't want to there are too many connections at the same time. – HaiWang Jan 17 '13 at 14:31
  • Can you download the data you want locally? You can then just send that data file along when you launch your job, and not have to worry about the mappers count. – mohit6up Jan 17 '13 at 15:32

5 Answers5

5

You can set the value of mapred.jobtracker.maxtasks.per.job to something other than -1 (the default). This limits the number of simultaneous map or reduce tasks a job can employ.

This variable is described as:

The maximum number of tasks for a single job. A value of -1 indicates that there is no maximum.

I think there were plans to add mapred.max.maps.per.node and mapred.max.reduces.per.node to job configs, but they never made it to release.

Dave
  • 6,064
  • 4
  • 31
  • 38
  • Deprecated in Hadoop 2.7.2, replaced with `mapreduce.jobtracker.maxtasks.perjob` ([ref](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html)) – Todd Owen Feb 01 '17 at 06:40
  • Also, reading the description carefully, I'm not sure this is the limit on *simultaneous* tasks. It may actually be a limit on the total tasks. There is another property `mapreduce.jobtracker.taskscheduler.maxrunningtasks.perjob` described as "The maximum number of running tasks for a job before it gets preempted." – Todd Owen Feb 01 '17 at 06:42
5

If you are using Hadoop 2.7 or newer, you can use mapreduce.job.running.map.limit and mapreduce.job.running.reduce.limit to restrict map and reduce tasks at each job level.

Fix JIRA ticket.

Joel
  • 1,650
  • 2
  • 22
  • 34
1

mapred.tasktracker.map.tasks.maximum is the property to restrict the number of map tasks that can run at a time. Have it configured in your mapred-site.xml.

Refer 2.7 in http://wiki.apache.org/hadoop/FAQ

Magham Ravi
  • 603
  • 4
  • 8
  • 2
    I think this variable is controling the number of map tasks that ran "in one tasktracker", not "in one job". – HaiWang Jan 17 '13 at 16:29
  • this parameter's description: The maximum number of map tasks that will be run simultaneously by a task tracker. – HaiWang Jan 17 '13 at 16:29
  • @HaiWang: From my reading of your original question, `mapred.tasktracker.map.tasks.maximum` solves the problem: it doesn't control the total number of mappers but the number of mappers that are run concurrently. Thus, it doesn't affect the logic or granularity of the job, but the rate at which resources are used. I had the same problem, and this parameter worked for me (easier than setting up a fair scheduler). – Jim Pivarski Jul 08 '13 at 19:11
0

The number of mappers fired are decided by the input block size. The input block size is the size of the chunks into which the data is divided and sent to different mappers while it is read from the HDFS. So in order to control the number of mappers we have to control the block size.

It can be controlled by setting the parameters, mapred.min.split.size and mapred.max.split.size, while configuring the job in MapReduce. The value is to be set in bytes. So if we have a 20 GB file, and we want to fire 40 mappers, then we need to set it to 20480 / 40 = 512 MB each. So for that the code would be,

conf.set("mapred.min.split.size", "536870912");
conf.set("mapred.max.split.size", "536870912");

where conf is an object of the org.apache.hadoop.conf.Configuration class.

aa8y
  • 3,854
  • 4
  • 37
  • 62
0

Read about scheduling jobs in Hadoop(for example "fair scheduler"). you can create a custom queue with to many configuration and then assign your job to that. if you limit your custom queue maximum map task to 10 then each job that assign to queue at most will have 10 concurrent map task.

Amin Raeiszadeh
  • 208
  • 1
  • 2
  • 10