We have multiple GPU servers for deep learning training. We train our code inside Docker containers.
Today every user can log into these machines interactively and run a job. There is no restriction on job duration, or how many GPUs one can allocate. We haggle over GPUs via a chat channel.
What's the simplest framework for queuing jobs and allocating them to GPUs? I'm looking for GPU-level granularity, and am fine with multiple user running on the same machine.
The desired workflow is:
- Allocate GPU
- Create docker container
- Run job (and grab requirements on top of vanilla container if necessary). Can be a bash file?
- Finish job / kill job if running too long and close container
Additionally, I'd like an interactive workflow similar to the above, where the time limits are reduced (say, 2 hours max of interactive time?)
I'd like to prevent someone from hogging a machine for a week, so a job should have a finite runtime and be killed afterward. Also, multiple people shouldn't be able to use the same GPU simultaneously.
I realize I can do this with a cron job that monitors and kills jobs, but am looking for something more elegant, readymade, and preferably with a nice UI. I've tried ClearML, but can't figure out how to use it for this purpose. I know SLURM is used for allocating entire machines. It's unclear to me if it's possible to allocate specific GPUs.