Using Portable Batch System (PBS) Arrays To Work On Different Files Concurrently

Question

I am trying to use PBS Arrays to submit in parallel 5 jobs using the same program on different files. PBS will start five different copies of the script, each with a different integer in the PBS_ARRAYID variable. The script would be run with: qsub script.pbs

My current code is below; while it works as-is, it's calculating the list of files multiple times in each batch process. Is there a more efficient way to do this?

#PBS -S /bin/bash
#PBS -t 1-5       #Makes the $PBS_ARRAYID have the integer values 1-5
#PBS -V

workdir="/user/test"

samtools sort` `find ${workdir}/*.bam | sed ${PBS_ARRAYID}'!d'` > `find ${workdir}/*.bam | sed ${PBS_ARRAYID}'!d' | sed "s/.bam/.sorted.bam/"`

Thanks for your comment. I tried to be more explicit in my question and replaced ls by find (as suggested by the shellcheck tool). — gdeniz, Nov 09 '17 at 01:10
The reason why I do not want to use a for loop in this situation is that I run this script on a server and the PBS Array allows me to run each $PBS_ARRAYID as a separate job on a separate CPU. So basically the $PBS_ARRAYID assumes in this case the integers 1 to 5 and I can use those to process 5 different .txt files. The program works runs correctly like displayed above. I just would like to learn how I can be more efficient in shell. Thanks! — gdeniz, Nov 09 '17 at 01:49

Charles Duffy · Accepted Answer · 2017-11-09T21:13:59.560

1

#PBS -S /bin/bash
#PBS -t 0-4       #Makes the $PBS_ARRAYID have the integer values 0-4
#PBS -V

workdir="/user/test"

files=( "$workdir"/*.bam )       # Expand the glob, store it in an array
infile="${files[$PBS_ARRAYID]}"  # Pick one item from that array

exec samtools sort "$infile" >"${infile%.bam}.sorted.bam"

Note:

files=( "$workdir"/*.bam ) performs a glob internal to bash (no ls needed) and stores the results of that glob in an array for reuse.
Arrays are zero-indexed; thus, we're using 0-4 instead of 1-5.
Using command substitutions -- `...`, or $(...) -- has significant performance overhead, and is best avoided.
Using exec for the last command in the script tells the shell interpreter it can replace itself with that command, rather than needing to remain in memory.

edited Nov 09 '17 at 21:13

answered Nov 09 '17 at 16:36

Charles Duffy

280,126
43
390
441

Very elegant, thank you so much. Will try to work on understanding now more about the theory of what you said! Unfortunately my reputation does not allow me to give you a positive vote, but thanks again! – gdeniz Nov 09 '17 at 20:59
If this solves the problem, you should be able to mark the question as resolved by clicking the checkbox next to the answer. – Charles Duffy Nov 09 '17 at 21:13
(And frankly, I think you've edited this to a sufficient level of quality that if further questions are up to the same standard, you won't stay low-reputation for long). – Charles Duffy Nov 09 '17 at 21:19

Using Portable Batch System (PBS) Arrays To Work On Different Files Concurrently

1 Answers1