I was trying to run multiple srun jobs within a single sbatch script on a cluster. The sbatch script is as follows:
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 64
#SBATCH --time=200:00:00
#SBATCH -p amd_256
for i in {0..6} ;
do
cd ${i}
( srun -c 8 ./MD 150 20 300 20 20 0 0 > log.out 2>&1 & )
sleep 20
cd ..
done
cd 7/
srun -c 8 ./MD 100 20 300 20 20 0 0 > log.out 2>&1
cd ..
wait
In this script I submitted multiple srun jobs. One problem with this script is that 0-6th job will be killed after the 7th job is finished. Here is the error message I got for the 0-6th job:
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** STEP 3801214.0 ON j2308 CANCELLED AT 2021-12-22T11:02:22 ***
srun: error: j2308: task 0: Terminated
Any idea on how to fix this?