5

I'm using a cluster with Torque/Maui system. I have a bash script that submit one job, using qsub command, and afterwards does several things, like move files, write ASCII files, and check the output from the job I submitted. Concerning this output, basically, If it contains the number 1 the job need to be submitted again. If different of 1, the bash script does something else.

The problem is that the qsub run in background, and all the bash is evaluated at once. I'd like to force qsub to behaves pretty much like awk, cat, sort, etc ... when the script just goes further after those commands finish - if not put in background.

So, I need to the bash stops at the first qsub, and continue running just after qsub get finished, it means, when the job finish. Is there any way of doing this ? It will be something similar to:

   -sync y    # in the SGE system, for instance.

what I have:

#!/bin/bash
.
.
some commands
.
.
qsub my_application  # need to wait until my_application get done
.
.
more commands
.
.
my_application_output=(`cat my_application_output.txt`)

case "$my_application_output" in
["1"])
     qsub my_application
     ;;
["0"])
     some commands
     ;;
["100"])
     some commands
     ;;
*)
     some commands
     exit 1

esac

.
.

some remarks


  • It is not convenient to use: qsub -I -x, once I'd like to keep the output on the output file; and do not want to lock out the node by starting a interactive mode (-I)
  • I guess it is not a simple job dependency problem, once the re-submission 1) could occurs, 2) could not, and, most important, if occurs(1), it can be several times.

Thanks for all

Quim
  • 161
  • 2
  • 7

3 Answers3

2

Quim Oct 3 at 4:05: "it is not a simple job dependency problem"

You must create a simple job dependency problem--simple enough for your script to handle, anyway. And in fact your script gates on my_application_output.txt, so why not just sleep on that? something like

#!/usr/bin/env bash
# I prefer to have constants at the top
my_application_output_fp='/path/to/my_application_output.txt' 
#
#
# some commands
#
#
qsub my_application
#
#
# more commands
#
#

# sleep until my_application outputs
while [[ ! -r "${my_application_output_fp}" ]] ; do
    sleep 1
done

my_application_output="$(cat ${my_application_output_fp})"
# process it

If my_application_output.txt gets written too long before the end of the end of my_application, change my_application to write a flag file just before it exits, and gate on that:

#!/usr/bin/env bash
my_application_flag_fp='/path/to/my_application_flag.txt' 
my_application_output_fp='/path/to/my_application_output.txt' 
#
#
# some commands
#
#
qsub my_application
#
#
# more commands
#
#

# sleep until my_application writes flag
while [[ ! -r "${my_application_flag_fp}" ]] ; do
    sleep 1
done

if [[ ! -r "${my_application_output_fp}" ]] ; then
    # handle error
fi
# else
my_application_output="$(cat ${my_application_output_fp})"
# process it
TomRoche
  • 1,464
  • 1
  • 16
  • 25
  • Hi @TomRoche, thanks a lot for your solution ... In fact I had to change my script a lot. I ended up with a solution very similar to what you have written. The point here is that, as far as I know, Torque/maui cannot handle this alone, and one needs to control this via shell. – Quim Oct 15 '14 at 15:29
2

The qsub command should return the id of the job to be executed, something similar to,

$qsub myapplication  
12345.hpc.host

You can then use it to check the status of your job with the qstat command,

$qstat 12345.hpc.host
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
12345.hpc.host            STDIN            user            00:00:00 Q queue

Once the job is completed, it no longer is displayed by the qstat command. In that case,

$qstat 12345.hpc.host
qstat: Unknown Job Id Error 12345.hpc.host

In fact, the output is even not necessary. One can discard it to /dev/null and simply check the exit status of the qstat command,

if qstat 12345.hpc.host &>/dev/null; then
    echo "Job is running"
else
    echo "Job is not running"
fi

Or even shorter,

qstat 12345.hpc.host &> /dev/null && echo "Job is running" || echo "Job is NOT running"

So what you want to achieve should now be rather simple. Launch the job, store its id in a variable and sleep until the qstat command fails,

JOBID=$(qsub myapplication)
while qstat $JOBID &> /dev/null; do
    sleep 5;
done;

You can store the while loop in a bash function to use in all your processing scripts. You can also expand on this idea to launch and wait for a list of jobs to run.

0

As per the qsub docs:

-sync y causes qsub to wait for the job to complete before exiting.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Hi @John, I am familiar with SGE, and there I can use -sync y. But as I wrote on my question, I am using Torque .... So, I am looking something similar to -sync y, but in TORQUE. Thanks for you answer anyway. – Quim Oct 03 '14 at 11:07
  • Well, according to this document: https://wikis.nyu.edu/display/NYUHPC/Tutorial+-+Submitting+a+job+using+qsub you can use `-W depend` to make a job which depends on another one's completion. Maybe you can use that? – John Zwinck Oct 03 '14 at 14:17
  • thanks for your comment, I think -W does not fit to my problem, not without doing a rethinking of my strategy, because very often it is impossible to forecast how many times I need to use qsub, and the syntaxe of the -W needs this ... Thanks again. – Quim Oct 03 '14 at 15:42