3

I have a BASH script that submits multiple serial jobs to the PBS queueing system. Once the jobs are submitted the script ends. The jobs then run on a cluster and when they are all finished I can move on to the next step. A typical workflow might involve several of these steps.

My question:

Is there a way for my script not to exit upon completion of the submission, but rather to sleep until ALL jobs submitted by that script have completed on the cluster, only then exiting?

Jack Walpole
  • 31
  • 1
  • 3
  • So far I have been waiting for steps before starting the next script manually. I can think of a clunky way to do it by checking for the existence of all the job.o/job.e files dumped upon completion of the job and counting them up. I haven't tried it yet though. – Jack Walpole Sep 04 '14 at 20:38
  • Add `wait` to the end of your script perhaps? (see `man bash` and `wait`) – David C. Rankin Sep 04 '14 at 20:49
  • @DavidC.Rankin: I don't think `wait` is the answer. These are submitted batch jobs, not child processes of the shell. – Keith Thompson Sep 04 '14 at 21:34
  • You could write a script that repeatedly calls `qsub` (preferably with some delay so you don't overload the system) and terminates when all the specified jobs have finished. It's ugly, but it should work. If nobody comes up with a better solution, I might post this answer. – Keith Thompson Sep 04 '14 at 21:40
  • I was thinking about that, Keith. I think I prefer a script that checks for the number of job.o files. That way you don't need to know the job submission numbers, nor do you have to keep calling qsub to check those jobs are not there. You just need to know how many jobs you've submitted, and that that number is equal to the number of *.o* files in a known directory. Incidentally, I've written something that does check for the number of job.o files. I could post it?? -- not really sure on the etiquette here (this is my first post) – Jack Walpole Sep 04 '14 at 22:24
  • You should be able to get the job ids from the output of `qsub`, but yes, you'll have to keep track of them. Counting the `.o` files sounds reasonable; just watch out for (a) `.o` files left over from previous runs, and (b) `.o` files that aren't job output `.o` is the standard suffix for object files). If your script isn't too big, you could post it as part of your question (if you're asking for improvements), or you could post it as an answer (yes, you can answer your own question). Checking `qstat` is probably a more general solution. – Keith Thompson Sep 04 '14 at 23:29

3 Answers3

1

You are trying to establish a workflow, correct? The best way to do what you're attempting to accomplish would be to use job dependencies. Essentially, what you are trying to do is submit X number of jobs, and then submit more jobs that depend on the first set of jobs, and you can do this with job dependencies. There are different ways to do dependencies that you can read about in the previous link, but here's an example of submitting 3 jobs and then submitting 3 more that won't execute until after the first 3 have exited.

#first batch
jobid1=`qsub ...`
jobid2=`qsub ...`
jobid3=`qsub ...`

#next batch
depend_str="-W after:${jobid1} -W after:${jobid2} -W after:${jobid3}"
qsub ... $depend_str
qsub ... $depend_str
qsub ... $depend_str
dbeer
  • 6,963
  • 3
  • 31
  • 47
0

One way to do this would be using GNU Parallel command 'sem'

I learnt about this doing queue stuff as well. It acts as a timer allowing commands to be executed after exiting etc.

Edit: I know the example here is really basic but there is a lot that can be achieved running tasks using parallel --sem or even just parallel in general. Have a look at the tutorial, I'm certain you will be able to find a relevant example that will help.

There is a great tutorial here

An example from a tutorial:

  sem 'sleep 1; echo The first finished' &&
    echo The first is now running in the background &&
    sem 'sleep 1; echo The second finished' &&
    echo The second is now running in the background
  sem --wait

Output:

The first is now running in the background

The first finished

The second is now running in the background

The second finished

See Man Page

264nm
  • 725
  • 4
  • 13
0

To actually check if a job is done, we need to use qstat and the job ID to get the job status and then grep the status for a status code. As long as your username or job name are not "C", the following should work:

#!/bin/bash

# SECTION 1: Launch all jobs and store their job IDs in a variable

myJobs="job1.qsub job2.qsub job3.qsub" # Your job names here
numJobs=$(echo "$myJobs" | wc -w)      # Count the jobs
myJobIDs=""                            # Initialize an empty list of job IDs
for job in $myJobs; do
    jobID_full=$(qsub $job)
    # jobID_full will look like "12345.machinename", so use sed
    # to get just the numbers
    jobID=$(echo "$jobID_full" | sed -e 's|\([0-9]*\).*|\1|')
    myJobIDs="$myJobIDs $jobID"        # Add this job ID to our list
done

# SECTION 2: Check the status of each job, and exit while loop only
# if they are all complete

numDone=0                              # Initialize so that loop starts
while [ $numDone -lt $numJobs ]; do    # Less-than operator
    numDone=0                          # Zero since we will re-count each time
    for jobID in $myJobIDs; do         # Loop through each job ID

        # The following if-statement ONLY works if qstat won't return
        # the string ' C ' (a C surrounded by two spaces) in any
        # situation besides a completed job.  I.e. if your username
        # or jobname is 'C' then this won't work!
        # Could add a check for error (grep -q ' E ') too if desired
        if qstat $jobID | grep -q ' C ' 
        then
            (( numDone++ ))
        else
            echo $numDone jobs completed out of $numJobs
            sleep 1
        fi
    done
done

echo all jobs complete
MountainDrew
  • 443
  • 4
  • 20