1

I need to run a script in multiple (70000) samples in parallel and I don't want to submit all at once to the queue. How do I schedule 100 at a time and each time one has finished another can be queued?

A lot of files are written as a result of running another software wrapped in my script. I also need to extract results from each file into a single results file.

I thought something about this:

# set maximum number of processes to run in SLURM
MAX_QUEUE=200

Protein_sequence='MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDRTDASTTSSTAIEDIINPSLDPQSAASPVPSSSFFHDSRKPSTSTHLVRRGTPLGIYQTNLYGHNSRENTNPNSTLLSSKLLAHPPVPYGQNPDLLQHAVYRAQPSSGTTNAQPRQTTRRYQSHKSRPAFVNKLWSMLNDDSNTKLIQWAEDGKSFIVTNREEFVHQILPKYFKHSNFASFVRQLNMYGWHKVQDVKSGSIQSSSDDKWQFENENFIRGREDLLEKIIRQKGSSNNHNSPSGNGNPANGSNIPLDNAAGSNNSNNNISSSNSFFNNGHLLQGKTLRLMNEANLGDKNDVTAILGELEQIKYNQIAISKDLLRINKDNELLWQENMMARERHRTQQQALEKMFRFLTSIVPHLDPKMIMDGLGDPKVNNEKLNSANNIGLNRDNTGTIDELKSNDSFINDDRNSFTNATTNARNNMSPNNDDNSIDTASTNTTNRKKNIDENIKNNNDIINDIIFNTNLANNLSNYNSNNNAGSPIRPYKQRYLLKNRANSSTSSENPSLTPFDIESNNDRKISEIPFDDEEEEETDFRPFTSRDPNNQTSENTFDPNRFTMLSDDDLKKDSHTNDNKHNESDLFWDNVHRNIDEQDARLQNLENMVHILSPGYPNKSFNNKTSSTNTNSNMESAVNVNSPGFNLQDYLTGESNSPNSVHSVPSNGSGSTPLPMPNDNDTEHASTSVNQGENGSGLTPFLTVDDHTLNDNNTSEGSTRVSPDIKFSATENTKVSDNLPSFNDHSYSTQADTAPENAKKRFVEEIPEPAIVEIQDPTEYNDHRLPKRAKK'

# 5' primer to add at "N" terminal (left of the sequence)
p5=${Protein_Sequence:463:30}

header=true # file has header and I have to skip it

# open file containing the sequence fused at the right of p5
for insert in `cat $1 | awk 'BEGIN{FS=","}{print $2}'`
do
    # if header, then continue with next iteration and flag header as false
    if [ $header = true ]
    then
        header=false
    else
        printf ">${insert}\n${p5}${insert}" > ${insert}.fasta # write fasta file (this is the input of psipred)

        # check how many processes are in the queue
        queue=$(squeue -u aerijman | wc -l)
        queue=$(echo $queue -1 | bc)

        # if few processes queued, proceed, else wait.
        if [ $queue -lt $MAX_QUEUE ]
        then
            sbatch -p campus -c 1 --job-name=${insert} --wrap="runpsipred ${insert}.fasta"
        else
            # take the chance to find *horiz files which contain the result
            for prefix in `ls *horiz`
            do
                # extract the resulting sequence of 2ry structure elements and append it to a ingle file with all esults
                horiz=$(while read line; do if [ "${line:0:4}" == Pred ]; then echo ${line:6:${#line}} | tr -d "\n"; fi; done < $prefix)
                printf ">${p5}${insert}\n${horiz}" >> horiz.results
                # rm all side files (from psipred-blast)
                rm ${prefix:0:-5}*
            done

            # This  loop is tracking if any process has finished (so a new processes can ve queued)
            while [ $queue -ge $MAX_QUEUE ]
            do
                queue=$(squeue -u aerijman | wc -l)
                queue=$(echo $queue -1 | bc)

            done
        fi
    fi
done

I apologize for including too much irrelevant information in this script, but I believe that my amateur way of making a loop watching for a vacancy in the queue can be changed for something smarter.

Any help will be appreciated!

aerijman
  • 2,522
  • 1
  • 22
  • 32
  • 1
    @Cyrus I'm sorry to say that I cannot see the answer in the supposed duplicated answer. `xargs` and `SLURM` are two completely different beasts! – Poshi Nov 14 '18 at 21:55
  • Dear @Cyrus, I asked the question because the only way I find to do what I am trying to do is extremely long and complicated (and I am still trying to make it work). You are seriously discouraging me on asking in this platform with your comments, downvote and marking the question as duplicate when I didn't know what xargs is. You could have just oriented me onto that answer (which doesn't yet provide me a solution). – aerijman Nov 14 '18 at 23:38
  • I didn't vote down. I reopened the question. Best regards, Cyrus – Cyrus Nov 15 '18 at 19:46
  • If you need help implementing the job array, don't hesitate to ask! – Poshi Nov 16 '18 at 23:09
  • Thanks Cyrus and @Poshi, --array=1:100%25 helps, but I also need to keep track of finished processes through their job-names. Is there any better way to do that than with my while loop? – aerijman Nov 18 '18 at 02:17
  • 1
    You can have different job names for every job in the array. You can use `sbatch --job-name` to give proper names like `PSIPRED-%A_%a`. `%a` is substituted for the job array index. You should know that the first job ran with the first input and so on. When the jobs finish, you should be able to retrieve the results and differentiate them easily through this association. – Poshi Nov 18 '18 at 10:43

0 Answers0