0

I have a python script I run on HPC that takes a list of files in a text file and starts multiple SBATCH runs:

./launch_job.sh 0_folder_file_list.txt

launch_job.sh goes through 0_folder_file_list.txt and starts an SBATCH for each file

SAMPLE_LIST=`cut -d "." -f 1 $1`

for SAMPLE in $SAMPLE_LIST
do
  echo "Getting accessions from $SAMPLE"
  sbatch get_acc.slurm $SAMPLE
  #./get_job.slurm $SAMPLE
done

get_job.slurm has all of my SBATCH information, module loads, etc. and performs

srun --mpi=pmi2 -n 5 python python_script.py ${SAMPLE}.txt

I don't want to start all of the jobs at one time, I would like them to run consecutively with a 24-hour maximum run time. I have already set my SBATCH -t to allow for a maximum time but I only want each job to run for a maximum of 24-hours. Is there a srun argument I can set that will accomplish this? Something else?

Damian
  • 3
  • 1

1 Answers1

0

You can use --wait flag with sbatch.

-W, --wait Do not exit until the submitted job terminates. The exit code of the sbatch command will be the same as the exit code of the submitted job. If the job terminated due to a signal rather than a normal exit, the exit code will be set to 1. In the case of a job array, the exit code recorded will be the highest value for any task in the job array.

In your case,

for SAMPLE in $SAMPLE_LIST
do
  echo "Getting accessions from $SAMPLE"
  sbatch --wait get_acc.slurm $SAMPLE
done

So, the next sbatch command will only be called after the first sbatch finishes (your job ended or time limit reached).

j23
  • 3,139
  • 1
  • 6
  • 13
  • 1
    Perfect. Thank you. Just so I'm perfectly clear, I can set #SBATCH -t 23:59:59 in my .slurm file to get my 24 hour desired timeout and --wait in my .sh file and the next job won't start until the previous finishes or times out? – Damian Mar 02 '22 at 16:10
  • Yes, indeed. :) If it is behaving as you intended, you can accept the answer then :D – j23 Mar 02 '22 at 16:19
  • 1
    Thank you. It is working as intended. Cheers – Damian Mar 03 '22 at 17:13