0

Right now, I am using Horovod to run distributed training of my pytorch models. I would like to start using hydra config for the --multirun feature and enqueue all jobs with SLURM. I know there is the Submitid plugin. But I am not sure, how would the whole pipeline work with Horovod. Right now, my command for training looks as follows:

CUDA_VISIBLE_DEVICES=2,3 horovodrun -np 2 python training_script.py \
--batch_size 30 \
...

Say I want to use hydra --multirun to run several multi-gpu experiments, I want to enqueue the runs with slurm since my resources are limited and would be run sequentially most of the time and I want to use Horovod to synchronize gradients of my networks. Would this setup run out of the box? Would I need to specify CUDA_VISIBLE_DEVICES if slurm took care of the resources? How would I need to adjust my run command or other settings to make this setup plausible? I am especially interested in how the multirun feature handles GPU resources. Any recommendations are welcome.

JAV
  • 279
  • 2
  • 9

1 Answers1

2

The Submitit plugin does support GPU allocation, but I am not familiar with Horovod and have no idea if this can work in conjunction with it. One new feature of Hydra 1.0 is the ability to set or copy environment variables from the launching process. This might come in handy in case Horovod is trying to set some environment variables. See the docs for info about it.

Omry Yadan
  • 31,280
  • 18
  • 64
  • 87
  • Thanks for the answer. Ok lets forget about horovod now. If I were to do my task manually, it would go as follows. I would enqueue each multi-gpu experiment with qsub and each experiment would manage gpu resources internally somehow. Is the - - multirun and slurm support in hydra suitable for this scenario? I would simply like to end up with several multigpu processes enqueued without having to worry about gpu allocation. I have 4 gpus and usually want to run each experiment on 2 gpus or so. – JAV Sep 28 '20 at 16:48
  • I think it is appropriate for it, but this is more a question for submitit. Go ahead and file a question on it's GitHub. the devs are familiar with the Hydra plugin. – Omry Yadan Sep 28 '20 at 23:19
  • You may need to configure SLURM to support it. See https://slurm.schedmd.com/gres.html – Omry Yadan Sep 28 '20 at 23:20