3

TL;DR: Is there any way to get SGE to round-robin between servers when scheduling jobs, instead of allocating all jobs to the same server whenever it can?

Details:

I have a large compute process that consists of many smaller jobs. I'm using SGE to distribute the work across multiple servers in a cluster.

The process requires a varying number of tasks at different points in time (technically, it is a DAG of jobs). Sometimes the number of parallel jobs is very large (~1 per CPU in the cluster), sometimes it is much smaller (~1 per server). The DAG is dynamic and not uniform so it isn't easy to tell how many parallel jobs there are/will at any given point.

The jobs use a lot of CPU but also do some non trivial amount of IO (especially at job startup and shutdown). They access a shared NFS server connected to all the compute servers. Each compute server has a narrower connection (10Gb/s) but the NFS server has several wide connections (40Gbs) into the communication switch. Not sure what the bandwidth of the switch backbone is, but it is a monster so it should be high.

For optimal performance, jobs should be scheduled across different servers when possible. That is, if I have 20 servers, each with 20 processors, submitting 20 jobs should run one job on each. Submitting 40 jobs should run 2 on each, etc. Submitting 400 jobs would saturate the whole cluster.

However, SGE is perversely intent on minimizing my I/O performance. Submitting 20 jobs would schedule all of them on a single server. So they all fight for a single measly 10Gb network connection when 19 other machines with a bandwidth of 190Gb sit idle.

I can force SGE to execute each job on a different server in several ways (using resources, using special queues, using my parallel environment and specifying '-t 1-', etc.). However, this means I will only be able to run one job per server, period. When the DAG opens up and spawns many jobs, the jobs will stall waiting for a completely free server while 19 out of the 20 processors of each machine will stay idle.

What I need is a way to tell SGE to to assign each job to the next server that has an available slot in a round-robin order. A better way would be to assign the job to the least loaded server (maximal number of unused slots, or maximal fraction of unused slots, or minimal number of used slots, etc.). But a dead simple round-robin would do the trick.

This seems like a much more sensible strategy in general, compared to SGE's policy of running each job on the same server as the previous job, which is just about the worst possible strategy for my case.

I looked over SGE's configuration options but I couldn't find any way to modify the scheduling strategy. That said, SGE's documentation isn't exactly easy to navigate, so I could have easily missed something.

Does anyone know of any way to get SGE to change its scheduling strategy to round-robin or least-loaded or anything along these lines?

Thanks!

Oren Ben-Kiki
  • 1,226
  • 11
  • 17
  • This make give you some hope at least. On our Univa Grid Engine cluster, my jobs are scattered all over the cluster seemingly randomly. I'm not using any qsub options to control the placement of the jobs on the compute nodes. However, I don't know the details of the cluster configuration -- I'm not a system admin. – Steve Mar 11 '19 at 12:35

2 Answers2

1

Simply change allocation_rule to $round_robin for the SGE parallel environment (sge_pe file):

allocation_rule

     The allocation rule is interpreted by the  scheduler  thread
     and helps the scheduler to decide how to distribute parallel
     processes among the available machines. If, for instance,  a
     parallel environment is built for shared memory applications
     only, all parallel processes have to be assigned to a single
     machine, no matter how much suitable machines are available.
     If, however, the parallel environment  follows  the  distri-
     buted  memory  paradigm,  an  even distribution of processes
     among machines may be favorable.
     The current version of the scheduler  only  understands  the
     following allocation rules:

<int>:    An integer number fixing the number  of  processes
           per  host.  If the number is 1, all processes have
           to reside  on  different  hosts.  If  the  special
           denominator  $pe_slots  is used, the full range of
           processes as specified with the qsub(1) -pe switch
           has  to  be  allocated on a single host (no matter
           which value belonging  to  the  range  is  finally
           chosen for the job to be allocated).

$fill_up: Starting from the best  suitable  host/queue,  all
          available  slots  are allocated. Further hosts and
          queues are "filled up" as  long  as  a  job  still
          requires slots for parallel tasks.

$round_robin:
          From all suitable hosts a single slot is allocated
          until  all tasks requested by the parallel job are
          dispatched. If more tasks are requested than suit-
          able hosts are found, allocation starts again from
          the  first  host.  The  allocation  scheme   walks
          through  suitable  hosts  in a best-suitable-first
          order.

Source: http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_pe.html

Vince
  • 3,325
  • 2
  • 23
  • 41
  • It never occurred to me this would be in a per-environment setting. In retrospect, this makes perfect sense. Many thanks! – Oren Ben-Kiki Mar 11 '19 at 16:32
  • Update: $round_robin doesn't work. When submitting a job array, it does a round-robin between the slots, instead of between the jobs. That is, the 1st job gets 1 slot on one machine, 1 slot on another machine, ... the 2nd job gets 1 slot on one machine, 1 slot on another machine, ... What is actually needed is a combination of $round_robin and $pe_slots, that is, each job's slots would be on one machine, and the round-robin would only apply between jobs and not between slots of a single job. You could say don't use a job array - except that SDE chokes if you submit too many jobs at once... – Oren Ben-Kiki Mar 12 '19 at 12:01
  • Look like you may have hit a limit to what you can accomplish with SGE. Maybe the only solution is to implement a combination of jobs and task arrays, such as submit one task array per node, specify target node using `-l host=$hostname` and limit the number of running slots on that node. I migrated to Torque/Maui a while ago, so my expertise in SGE is limited. – Vince Mar 12 '19 at 14:35
  • Thanks. I'm sort-of-making-due with submitting single-threaded jobs to a non-parallel-environment (which round-robins them between servers) and multi-threaded jobs to a parallel environment (which uses $pe_slots so batches them on few servers). If each multi-threaded job takes over the whole server, things work fine. But often this isn't the case and I get sub-optimal allocations and increased run-times :-( It wouldn't be easy to get my IT people to switch the cluster software... – Oren Ben-Kiki Mar 12 '19 at 18:20
0

This looks like a job for sched_conf. You probably want to set queue_sort_method to load. Add an appropriate load_formula and possibly play with job_load_adjustments and load_adjustment_decay_times perhaps setting load_thresholds on each queue.

William Hay
  • 2,148
  • 16
  • 20