0

We have multiple GPU servers for deep learning training. We train our code inside Docker containers.

Today every user can log into these machines interactively and run a job. There is no restriction on job duration, or how many GPUs one can allocate. We haggle over GPUs via a chat channel.

What's the simplest framework for queuing jobs and allocating them to GPUs? I'm looking for GPU-level granularity, and am fine with multiple user running on the same machine.

The desired workflow is:

  • Allocate GPU
  • Create docker container
  • Run job (and grab requirements on top of vanilla container if necessary). Can be a bash file?
  • Finish job / kill job if running too long and close container

Additionally, I'd like an interactive workflow similar to the above, where the time limits are reduced (say, 2 hours max of interactive time?)

I'd like to prevent someone from hogging a machine for a week, so a job should have a finite runtime and be killed afterward. Also, multiple people shouldn't be able to use the same GPU simultaneously.

I realize I can do this with a cron job that monitors and kills jobs, but am looking for something more elegant, readymade, and preferably with a nice UI. I've tried ClearML, but can't figure out how to use it for this purpose. I know SLURM is used for allocating entire machines. It's unclear to me if it's possible to allocate specific GPUs.

Shahar
  • 71
  • 9

0 Answers0