1

I was trying to run multiple srun jobs within a single sbatch script on a cluster. The sbatch script is as follows:

#!/bin/bash  
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 64
#SBATCH --time=200:00:00
#SBATCH -p amd_256
for i in {0..6} ;
do
  cd ${i}
  ( srun -c 8 ./MD 150 20 300 20 20 0 0 > log.out 2>&1 & )
  sleep 20
  cd ..
done
cd 7/
srun -c 8 ./MD 100 20 300 20 20 0 0 > log.out 2>&1 
cd ..

wait

In this script I submitted multiple srun jobs. One problem with this script is that 0-6th job will be killed after the 7th job is finished. Here is the error message I got for the 0-6th job:

srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** STEP 3801214.0 ON j2308 CANCELLED AT 2021-12-22T11:02:22 ***
srun: error: j2308: task 0: Terminated

Any idea on how to fix this?

andy90
  • 525
  • 5
  • 19

1 Answers1

2

The line

( srun -c 8 ./MD 150 20 300 20 20 0 0 > log.out 2>&1 & )

creates a subshells and puts them into the background inside the subshell. So the wait-call in the last line doesn't know about those background processes, as they are part of a different shell/process. And since the batch script is now finished, the job will be terminated.

Try this:

( srun -c 8 ./MD 150 20 300 20 20 0 0 > log.out 2>&1 ) &

As an example: Try

( sleep  60 & )
wait

and

( sleep  60 ) & 
wait

to see the difference.

Marcus Boden
  • 1,495
  • 8
  • 11