Enabling Univa Grid Engine Resource Reservation without a time limit on jobs

Question

My organization has a server cluster running Univa Grid Engine 8.4.1, with users submitting various kinds of jobs, some using a single CPU core, and some using OpenMPI to utilize multiple cores, all with varying and unpredictable run-times.

We've enabled a ticketing system so that one user can't hog the entire queue, but if the grid and queue are full of single-CPU jobs, no multi-CPU job can ever start (they just sit at the top of the queue waiting for the required number of cpu slots to become free, which generally never happens). We're looking to configure Resource Reservation such that, if the MPI job is the next in the queue, the grid will hold slots open as they become free until there's enough to submit the MPI job, rather than filling them with the single-CPU jobs that are further down in the queue.

I've read (here for example) that the grid makes the decision of which slots to "reserve" based on how much time is remaining on the jobs running in those slots. The problem we have is that our jobs have unknown run-times. Some take a few seconds, some take weeks, and while we have a rough idea how long a job will take, we can never be sure. Thus, we don't want to start running qsub with hard and soft time limits through -l h_rt and -l s_rt, or else our jobs could be killed prematurely. Resource Reservation appears to be using the default_duration, which we set to infinity for lack of a better number to use, and treating all jobs equally. Its picking slots filled by month-long jobs which have already been running for a few days, instead of slots filled by minute-long jobs which have only been running for a few seconds.

Is there a way to tell the scheduler to reserve slots for a multi-CPU MPI job as they become available, rather than pre-select slots based on some perceived run-time of the jobs in them?

I have exactly this problem. We are having to add lots of extra infrastructure to work around this. But why is it that SGE doesn't just hold off submitting single slot jobs until it has enough slots for the multi-slot job? — WestHamster, Jun 05 '20 at 12:08
That's because it picks which slots to reserve when the MPI job hits the top of the queue, but doesn't adjust that if other slots open up earlier. If all your jobs have the same run-time (maybe inherited from the queue configuration instead of the user), it'll just pick the oldest jobs. We sort-of got around this problem by using the "-l d_rt=HH:MM:SS" qsub option, which is the "estimated" run-time. A job won't be killed if it exceeds this time limit, but it seems to help resource reservation pick the best slots to hold open. Not perfect, but better than nothing. — Xirin, Jun 06 '20 at 15:14

score 0 · Accepted Answer · answered Mar 15 '17 at 19:43

Unfortunately I'm not aware of a way to do what you ask - I think that the reservation is created once at the time that the job is submitted, not progressively as slots become free. If you haven't already seen the design document for the Resource Reservation feature, it's worth a look to get oriented to the feature.

Instead, I'm going to suggest some strategies for confidently setting job runtimes. The main problem when none of your jobs have runtimes is that Grid Engine can't reserve space infinitely in the future, so even if you set some really rough runtimes (within an order of magnitude of the true runtime), you may get some positive results.

If you've run a similar job previously, one simple rule of thumb is to set max runtime to 150% of the typical or maximum runtime of the job, based on historical trends. Use qacct or parse the accounting file to get hard data. Of course, tweak that percentage to whatever suits your risk threshold.
Another rule of thumb is to set the max runtime not based on the job's true runtime, but based on a sense around "after this date, the results won't be useful" or "if it takes this long, something's definitely wrong". If you need an answer by Friday, there's no sense in setting the runtime limit for three months out. Similarly, if you're running md5sum on typically megabyte-sized files, there's no sense in setting a 1-day runtime limit; those jobs ought to only take a few seconds or minutes, and if it's really taking a long time, then something is broken.
If you really must allow true indefinite-length jobs, then one option is to divide your cluster into infinite and finite queues. Jobs specifying a finite runtime will be able to use both queues, while infinite jobs will have fewer resources available; this will incentivize users to work a little harder at picking runtimes, without forcing them to do so.

Finally, be sure that the multi-slot jobs are submitted with the -R y qsub flag to enable the resource reservation system. This could go in the system default sge_request file, but that's generally not recommended as it can reduce scheduling performance:

Since reservation scheduling performance consumption is known to grow with the number of pending jobs, use of -R y option is recommended only for those jobs actually queuing for bottleneck resources.

Thanks for the response. I had a feeling there might not be a way to do what I wanted, but I'm glad to get a second opinion. Thanks as well for referring me to that document! — Xirin, Mar 16 '17 at 17:36

Enabling Univa Grid Engine Resource Reservation without a time limit on jobs

1 Answers1