There are two ways to allocate GPUs in Slurm: either the general --gres=gpu:N
parameter, or the specific parameters like --gpus-per-task=N
. There are also two ways to launch MPI tasks in a batch script: either using srun
, or using the usual mpirun
(when OpenMPI is compiled with Slurm support). I found some surprising differences in behaviour between these methods.
I'm submitting a batch job with sbatch
where the basic script is the following:
#!/bin/bash
#SBATCH --job-name=sim_1 # job name (default is the name of this file)
#SBATCH --output=log.%x.job_%j # file name for stdout/stderr (%x will be replaced with the job name, %j with the jobid)
#SBATCH --time=1:00:00 # maximum wall time allocated for the job (D-H:MM:SS)
#SBATCH --partition=gpXY # put the job into the gpu partition
#SBATCH --exclusive # request exclusive allocation of resources
#SBATCH --mem=20G # RAM per node
#SBATCH --threads-per-core=1 # do not use hyperthreads (i.e. CPUs = physical cores below)
#SBATCH --cpus-per-task=4 # number of CPUs per process
## nodes allocation
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=2 # MPI processes per node
## GPU allocation - variant A
#SBATCH --gres=gpu:2 # number of GPUs per node (gres=gpu:N)
## GPU allocation - variant B
## #SBATCH --gpus-per-task=1 # number of GPUs per process
## #SBATCH --gpu-bind=single:1 # bind each process to its own GPU (single:<tasks_per_gpu>)
# start the job in the directory it was submitted from
cd "$SLURM_SUBMIT_DIR"
# program execution - variant 1
mpirun ./sim
# program execution - variant 2
#srun ./sim
The #SBATCH
options in the first block are quite obvious and uninteresting. Next, the behaviour I'll describe is observable when the job runs on at least 2 nodes. I'm running 2 tasks per node since we have 2 GPUs per node.
Finally, there are two variants of GPU allocation (A and B) and two variants of program execution (1 and 2). Hence, 4 variants in total: A1, A2, B1, B2.
Variant A1 (--gres=gpu:2, mpirun)
Variant A2 (--gres=gpu:2, srun)
In both variants A1 and A2, the job executes correctly with optimal performance, we have the following output in the log:
Rank 0: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 1: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 2: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 3: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1
Variant B1 (--gpus-per-task=1, mpirun)
The job is not executed correctly, GPUs are not mapped correctly due to CUDA_VISIBLE_DEVICES=0
on the second node:
Rank 0: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 1: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 2: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 3: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Note that this variant behaves the same with and without --gpu-bind=single:1
.
Variant B2 (--gpus-per-task=1, --gpu-bind=single:1, srun)
GPUs are mapped correctly (now each process sees only one GPU because of --gpu-bind=single:1
):
Rank 0: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 1: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=1
Rank 2: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 3: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=1
However, an MPI error appears when the ranks start to communicate (similar message is repeated once for each rank):
--------------------------------------------------------------------------
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
Hostname: gp11
cuIpcOpenMemHandle return value: 217
address: 0x7f40ee000000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory. Try to reduce the device
memory footprint of your application.
--------------------------------------------------------------------------
Although it says "This is an unrecoverable error", the execution seems to proceed just fine, except the log is littered with messages like this (assuming one message per MPI communication call):
[gp11:122211] Failed to register remote memory, rc=-1
[gp11:122212] Failed to register remote memory, rc=-1
[gp12:62725] Failed to register remote memory, rc=-1
[gp12:62724] Failed to register remote memory, rc=-1
Clearly this is an OpenMPI error message. I found an old thread about this error, which suggested to use --mca btl_smcuda_use_cuda_ipc 0
to disable CUDA IPC. However, since srun
was used in this case to launch the program, I have no idea how to pass such parameters to OpenMPI.
Note that in this variant --gpu-bind=single:1
affects only the visible GPUs (CUDA_VISIBLE_DEVICES
). But even without this option, each task is still able to select the right GPU and the errors still appear.
Any idea what is going on and how to address the errors in variants B1 and B2? Ideally we would like to use --gpus-per-task
which is more flexible than --gres=gpu:...
(it's one less parameter to change when we change --ntasks-per-node
). Using mpirun
vs srun
does not matter to us.
We have Slurm 20.11.5.1, OpenMPI 4.0.5 (built with --with-cuda
and --with-slurm
) and CUDA 11.2.2. The operating system is Arch Linux. The network is 10G Ethernet (no InfiniBand or OmniPath). Let me know if I should include more info.