Queuing deep learning training Jobs on a shared server

Question

We have multiple GPU servers for deep learning training. We train our code inside Docker containers.

Today every user can log into these machines interactively and run a job. There is no restriction on job duration, or how many GPUs one can allocate. We haggle over GPUs via a chat channel.

What's the simplest framework for queuing jobs and allocating them to GPUs? I'm looking for GPU-level granularity, and am fine with multiple user running on the same machine.

The desired workflow is:

Allocate GPU
Create docker container
Run job (and grab requirements on top of vanilla container if necessary). Can be a bash file?
Finish job / kill job if running too long and close container

Additionally, I'd like an interactive workflow similar to the above, where the time limits are reduced (say, 2 hours max of interactive time?)

I'd like to prevent someone from hogging a machine for a week, so a job should have a finite runtime and be killed afterward. Also, multiple people shouldn't be able to use the same GPU simultaneously.

I realize I can do this with a cron job that monitors and kills jobs, but am looking for something more elegant, readymade, and preferably with a nice UI. I've tried ClearML, but can't figure out how to use it for this purpose. I know SLURM is used for allocating entire machines. It's unclear to me if it's possible to allocate specific GPUs.

Queuing deep learning training Jobs on a shared server

0 Answers0