GPU allocation in Slurm: --gres vs --gpus-per-task, and mpirun vs srun

Question

There are two ways to allocate GPUs in Slurm: either the general --gres=gpu:N parameter, or the specific parameters like --gpus-per-task=N. There are also two ways to launch MPI tasks in a batch script: either using srun, or using the usual mpirun (when OpenMPI is compiled with Slurm support). I found some surprising differences in behaviour between these methods.

I'm submitting a batch job with sbatch where the basic script is the following:

#!/bin/bash

#SBATCH --job-name=sim_1        # job name (default is the name of this file)
#SBATCH --output=log.%x.job_%j  # file name for stdout/stderr (%x will be replaced with the job name, %j with the jobid)
#SBATCH --time=1:00:00          # maximum wall time allocated for the job (D-H:MM:SS)
#SBATCH --partition=gpXY        # put the job into the gpu partition
#SBATCH --exclusive             # request exclusive allocation of resources
#SBATCH --mem=20G               # RAM per node
#SBATCH --threads-per-core=1    # do not use hyperthreads (i.e. CPUs = physical cores below)
#SBATCH --cpus-per-task=4       # number of CPUs per process

## nodes allocation
#SBATCH --nodes=2               # number of nodes
#SBATCH --ntasks-per-node=2     # MPI processes per node

## GPU allocation - variant A
#SBATCH --gres=gpu:2            # number of GPUs per node (gres=gpu:N)

## GPU allocation - variant B
## #SBATCH --gpus-per-task=1       # number of GPUs per process
## #SBATCH --gpu-bind=single:1     # bind each process to its own GPU (single:<tasks_per_gpu>)

# start the job in the directory it was submitted from
cd "$SLURM_SUBMIT_DIR"

# program execution - variant 1
mpirun ./sim

# program execution - variant 2
#srun ./sim

The #SBATCH options in the first block are quite obvious and uninteresting. Next, the behaviour I'll describe is observable when the job runs on at least 2 nodes. I'm running 2 tasks per node since we have 2 GPUs per node. Finally, there are two variants of GPU allocation (A and B) and two variants of program execution (1 and 2). Hence, 4 variants in total: A1, A2, B1, B2.

Variant A1 (--gres=gpu:2, mpirun)

Variant A2 (--gres=gpu:2, srun)

In both variants A1 and A2, the job executes correctly with optimal performance, we have the following output in the log:

Rank 0: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 1: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 2: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 3: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1

Variant B1 (--gpus-per-task=1, mpirun)

The job is not executed correctly, GPUs are not mapped correctly due to CUDA_VISIBLE_DEVICES=0 on the second node:

Rank 0: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 1: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 2: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 3: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0

Note that this variant behaves the same with and without --gpu-bind=single:1.

Variant B2 (--gpus-per-task=1, --gpu-bind=single:1, srun)

GPUs are mapped correctly (now each process sees only one GPU because of --gpu-bind=single:1):

Rank 0: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 1: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=1
Rank 2: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 3: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=1

However, an MPI error appears when the ranks start to communicate (similar message is repeated once for each rank):

--------------------------------------------------------------------------
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
  Hostname:                         gp11
  cuIpcOpenMemHandle return value:  217
  address:                          0x7f40ee000000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory.  Try to reduce the device
memory footprint of your application.
--------------------------------------------------------------------------

Although it says "This is an unrecoverable error", the execution seems to proceed just fine, except the log is littered with messages like this (assuming one message per MPI communication call):

[gp11:122211] Failed to register remote memory, rc=-1
[gp11:122212] Failed to register remote memory, rc=-1
[gp12:62725] Failed to register remote memory, rc=-1
[gp12:62724] Failed to register remote memory, rc=-1

Clearly this is an OpenMPI error message. I found an old thread about this error, which suggested to use --mca btl_smcuda_use_cuda_ipc 0 to disable CUDA IPC. However, since srun was used in this case to launch the program, I have no idea how to pass such parameters to OpenMPI.

Note that in this variant --gpu-bind=single:1 affects only the visible GPUs (CUDA_VISIBLE_DEVICES). But even without this option, each task is still able to select the right GPU and the errors still appear.

Any idea what is going on and how to address the errors in variants B1 and B2? Ideally we would like to use --gpus-per-task which is more flexible than --gres=gpu:... (it's one less parameter to change when we change --ntasks-per-node). Using mpirun vs srun does not matter to us.

We have Slurm 20.11.5.1, OpenMPI 4.0.5 (built with --with-cuda and --with-slurm) and CUDA 11.2.2. The operating system is Arch Linux. The network is 10G Ethernet (no InfiniBand or OmniPath). Let me know if I should include more info.

score 1 · Answer 1 · answered Feb 03 '22 at 14:17

I'm having a related issue. Running with

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:1

will result in the processes sharing a single GPU

"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=3: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"

This I suppose is correct.

Running with

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gpus-per-task=1

will result in only the last process receiving a GPU

"PROCID=2: No devices found."
"PROCID=3: No devices found."
"PROCID=0: No devices found."
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2)"

Note different IDs in consecutive runs

"PROCID=2: No devices found."
"PROCID=1: No devices found."
"PROCID=3: No devices found."
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"

Running with

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:4

will result in each process having access to all 4 GPUs

"PROCID=3: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"

Running with

#SBATCH --ntasks=4
#SBATCH --gres=gpu:4
#SBATCH --gpu-bind=single:1

will again result in only the last process receiving a GPU

"PROCID=1: No devices found."
"PROCID=0: No devices found."
"PROCID=3: No devices found."
"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0)"

This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/30980568) — SavPhill, Feb 09 '22 at 08:43
I understand, but this was not intended as an answer. I posted the same issue, but with further explanations in hope that someone will have the answer based on the combined information. I too am looking for the answer. — Iztok Lebar Bajec, Feb 22 '22 at 08:21

GPU allocation in Slurm: --gres vs --gpus-per-task, and mpirun vs srun

Variant A1 (--gres=gpu:2, mpirun)

Variant A2 (--gres=gpu:2, srun)

Variant B1 (--gpus-per-task=1, mpirun)

Variant B2 (--gpus-per-task=1, --gpu-bind=single:1, srun)

1 Answers1