Requesting 2 GPUs using SLURM and running 1 python script

Question

I am trying to allocate 2 GPUs and run 1 python script over these 2 GPUs. The python script requires the variables $AMBERHOME, which is obtained by sourcing the amber.sh script, and $CUDA_VISIBLE_DEVICES. The $CUDA_VISIBLE_DEVICES variable should equal something like 0,1 for the two GPUS I have requested.

Currently, I have been experimenting with this basic script.

#!/bin/bash
#
#BATCH --job-name=test
#SBATCH --output=slurm_info
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --time=5:00:00
#SBATCH --partition=gpu-v100

## Prepare Run
source /usr/local/amber20/amber.sh
export CUDA_VISIBLE_DEVICES=0,1

## Perform Run
python calculations.py

When I run the script, I can see that 2 GPUs are requested.

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
11111   GPU         test   jsmith CF       0:02     2     gpu-[1-2]

When I look at the output ('slurm_info') I see,

cpu-bind=MASK - gpu-1, task  0  0 [10111]: mask 0x1 set

and of course information about the failed job.

Typically when I run this script on my local workstation, I have 2 GPUs there and when entering nvidia-smi into the command line, I see...

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   29C    P0    24W / 300W |      0MiB / 16160MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   29C    P0    24W / 300W |      0MiB / 16160MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

However, when I use nvidia-smi with my previous batch script on the cluster I see the following.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   29C    P0    24W / 300W |      0MiB / 16160MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

This makes me think that when the python script runs it is only seeing the one GPU.

score 2 · Answer 1 · answered Jan 03 '22 at 08:33

You are requesting two nodes, not two GPUs. The correct syntax for requesting GPUs depends on the Slurm version and how your cluster is set up. But you generally use #SBATCH -G 2 to request two GPUs.

Slurm usually also takes care of CUDA_VISIBLE_DEVICES for you, so no need for that. Try this:

#!/bin/bash
#
#BATCH --job-name=test
#SBATCH --output=slurm_info
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2 #change that accordingly
#SBATCH -G 2
#SBATCH --time=5:00:00
#SBATCH --partition=gpu-v100

## Prepare Run
source /usr/local/amber20/amber.sh

## Perform Run
python calculations.py

Requesting 2 GPUs using SLURM and running 1 python script

1 Answers1