5

I am little confused about the terms "Job scheduling" and "Task scheduling" in Hadoop when I was reading about delayed fair scheduling in this slide.

Please correct me if I am wrong in my following assumptions:

  1. Default scheduler, Capacity scheduler and Fair schedulers are only valid at job level when multiple jobs are scheduled by the user. They don't play any role if there is only single job in the system. These scheduling algorithms form basis for "job scheduling"

  2. Each job can have multiple map and reduce tasks and how are they assigned to each machine? How are tasks scheduled for a single job? What is the basis for "task scheduling"?

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
GoT
  • 530
  • 1
  • 13
  • 35
  • 1
    I am not sure what you are talking about. I opened the presentation you are linking, and there is no one single mention of the term "job scheduling" or "task scheduling". I also took a look at the full paper, and there is no one single mention of "task scheduling" and just one mention of "job scheduling", in which the authors explain how job scheduling works in Hadoop (version 1, not version 2). Please point to the specific sections in the delay scheduling paper or presentation that are confusing to you. – cabad Sep 30 '13 at 16:27
  • I didn't understand slides 6 and 7 where scheduled tasks of each job are shown. – GoT Sep 30 '13 at 20:31

1 Answers1

5

In case of fair scheduler, when there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time.

Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job gets.

The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee. The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations who collectively fund the cluster based on computing needs. There is an added benefit that an organization can access any excess capacity not being used by others. This provides elasticity for the organizations in a cost-effective manner.

SSaikia_JtheRocker
  • 5,053
  • 1
  • 22
  • 41
  • So can I assume that job scheduler type don't play any role if there is only one job in the system – GoT Sep 30 '13 at 19:33
  • If there is only one job in the system, how are tasks scheduled on different machines for that job? – GoT Sep 30 '13 at 19:34
  • 1
    In case of fair scheduler, the job for that matter the tasks uses the entire power of the cluster as mentioned above. – SSaikia_JtheRocker Sep 30 '13 at 19:35
  • In case of capacity schedulers, it's little different. Please see edit. – SSaikia_JtheRocker Sep 30 '13 at 19:44
  • 1
    Do you have anything more to discuss? – SSaikia_JtheRocker Sep 30 '13 at 19:45
  • Thanks.But, I am still confused. I understand the various job schedulers, but they will come into play only when we have multiple jobs right?. Am I wrong? In case where I have only one job, how does various job scheduling algorithms matter ?. hence, I asked how tasks are scheduled in such a case? – GoT Sep 30 '13 at 20:30
  • 1
    Yeah, they don't matter unless it's capacity scheduler, where the use of extra slots for tasks by a single running job can be configured. See [mapred.capacity-scheduler.queue..maximum-capacity](http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+allocation) – SSaikia_JtheRocker Sep 30 '13 at 21:18