Right now, I am using Horovod to run distributed training of my pytorch models. I would like to start using hydra config for the --multirun feature and enqueue all jobs with SLURM. I know there is the Submitid plugin. But I am not sure, how would the whole pipeline work with Horovod. Right now, my command for training looks as follows:
CUDA_VISIBLE_DEVICES=2,3 horovodrun -np 2 python training_script.py \
--batch_size 30 \
...
Say I want to use hydra --multirun to run several multi-gpu experiments, I want to enqueue the runs with slurm since my resources are limited and would be run sequentially most of the time and I want to use Horovod to synchronize gradients of my networks. Would this setup run out of the box? Would I need to specify CUDA_VISIBLE_DEVICES if slurm took care of the resources? How would I need to adjust my run command or other settings to make this setup plausible? I am especially interested in how the multirun feature handles GPU resources. Any recommendations are welcome.