2

I have have a bash script submit.sh for submitting training jobs to a Slurm server. It works as follows. Doing

bash submit.sh p1 8 config_file

will submit some task corresponding to config_file to 8 GPUs of partition p1. Each node of p1 has 4 GPUs, thus this command requests 2 nodes.

The content of submit.sh can be summarized as follows, in which I use sbatch to submit a Slurm script (train.slurm):

#!/bin/bash
# submit.sh

PARTITION=$1
NGPUs=$2
CONFIG=$3

NGPUS_PER_NODE=4
NCPUS_PER_TASK=10

sbatch --partition ${PARTITION} \
    --job-name=${CONFIG} \
    --output=logs/${CONFIG}_%j.log \
    --ntasks=${NGPUs} \
    --ntasks-per-node=${NGPUS_PER_NODE} \
    --cpus-per-task=${NCPUS_PER_TASK} \
    --gres=gpu:${NGPUS_PER_NODE} \
    --hint=nomultithread \
    --time=10:00:00
    --export=CONFIG=${CONFIG},NGPUs=${NGPUs},NGPUS_PER_NODE=${NGPUS_PER_NODE} \
    train.slurm

Now in the Slurm script, train.slurm, I decide whether to launch the training Python script on one or multiple nodes (the ways to launch it are different in these two cases):

#!/bin/bash
# train.slurm
#SBATCH --distribution=block:block

# Load Python environment
module purge
module load pytorch/py3/1.6.0
 
set -x

if [ ${NGPUs} -gt ${NGPUS_PER_NODE} ]; then # Multi-node training
    # Some variables needed for the training script
    export MASTER_PORT=12340
    export WORLD_SIZE=${NGPUs}
    # etc.

    srun python train.py --cfg ${CONFIG}
else # Single-node training
    python -u -m torch.distributed.launch --nproc_per_node=${NGPUS_PER_NODE} --use_env train.py --cfg ${CONFIG}
fi

Now if I submit on a single node (e.g., bash submit.sh p1 4 config_file), it works as expected. However, submitting on multiple nodes (e.g., bash submit.sh p1 8 config_file) produced the following error:

slurmstepd: error: execve(): python: No such file or directory

This means that the Python environment was not recognized on one of the nodes. I tried replacing python with $(which python) to take the full path to the Python binary in the virtual environment, but then I obtained another error:

OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

If I don't use submit.sh but instead, add all the #SBATCH variable to train.slurm, and submit the job using sbatch directly from the command line, then it works. Therefore, it seems that wrapping sbatch inside a bash script caused this issue.

Could you please help me to resolve this?

Thank you so much in advance.

f10w
  • 1,524
  • 4
  • 24
  • 39

1 Answers1

7

Beware that the --export parameter will cause the environment for srun to be reset to exactly all the SLURM_* variables plus the ones explicitly set, so in your case CONFIG,NGPUs, NGPUS_PER_NODE. Consequently, the PATH variable will not be set and srun will not find the python executable.

Note that the --export does not alter the environment of the submission script, so the single-node case, that does not use srun, does indeed run fine.

Try submitting with

--export=ALL,CONFIG=${CONFIG},NGPUs=${NGPUs},NGPUS_PER_NODE=${NGPUS_PER_NODE} \

Note the added ALL as first item in the list.

Another option is to simply remove the --export line entirely and export the variables explicitly in the submit.sh script as the submission environment is propagated by default by Slurm to the job.

export PARTITION=$1
export NGPUs=$2
export CONFIG=$3

export NGPUS_PER_NODE=4
export NCPUS_PER_TASK=10
damienfrancois
  • 52,978
  • 9
  • 96
  • 110