I have have a bash script submit.sh
for submitting training jobs to a Slurm server. It works as follows. Doing
bash submit.sh p1 8 config_file
will submit some task corresponding to config_file
to 8 GPUs of partition p1
. Each node of p1
has 4 GPUs, thus this command requests 2 nodes.
The content of submit.sh
can be summarized as follows, in which I use sbatch
to submit a Slurm script (train.slurm
):
#!/bin/bash
# submit.sh
PARTITION=$1
NGPUs=$2
CONFIG=$3
NGPUS_PER_NODE=4
NCPUS_PER_TASK=10
sbatch --partition ${PARTITION} \
--job-name=${CONFIG} \
--output=logs/${CONFIG}_%j.log \
--ntasks=${NGPUs} \
--ntasks-per-node=${NGPUS_PER_NODE} \
--cpus-per-task=${NCPUS_PER_TASK} \
--gres=gpu:${NGPUS_PER_NODE} \
--hint=nomultithread \
--time=10:00:00
--export=CONFIG=${CONFIG},NGPUs=${NGPUs},NGPUS_PER_NODE=${NGPUS_PER_NODE} \
train.slurm
Now in the Slurm script, train.slurm
, I decide whether to launch the training Python script on one or multiple nodes (the ways to launch it are different in these two cases):
#!/bin/bash
# train.slurm
#SBATCH --distribution=block:block
# Load Python environment
module purge
module load pytorch/py3/1.6.0
set -x
if [ ${NGPUs} -gt ${NGPUS_PER_NODE} ]; then # Multi-node training
# Some variables needed for the training script
export MASTER_PORT=12340
export WORLD_SIZE=${NGPUs}
# etc.
srun python train.py --cfg ${CONFIG}
else # Single-node training
python -u -m torch.distributed.launch --nproc_per_node=${NGPUS_PER_NODE} --use_env train.py --cfg ${CONFIG}
fi
Now if I submit on a single node (e.g., bash submit.sh p1 4 config_file
), it works as expected. However, submitting on multiple nodes (e.g., bash submit.sh p1 8 config_file
) produced the following error:
slurmstepd: error: execve(): python: No such file or directory
This means that the Python environment was not recognized on one of the nodes. I tried replacing python
with $(which python)
to take the full path to the Python binary in the virtual environment, but then I obtained another error:
OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory
If I don't use submit.sh
but instead, add all the #SBATCH
variable to train.slurm
, and submit the job using sbatch
directly from the command line, then it works. Therefore, it seems that wrapping sbatch
inside a bash script caused this issue.
Could you please help me to resolve this?
Thank you so much in advance.