I need to run a script in multiple (70000) samples in parallel and I don't want to submit all at once to the queue. How do I schedule 100 at a time and each time one has finished another can be queued?
A lot of files are written as a result of running another software wrapped in my script. I also need to extract results from each file into a single results file.
I thought something about this:
# set maximum number of processes to run in SLURM
MAX_QUEUE=200
Protein_sequence='MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDRTDASTTSSTAIEDIINPSLDPQSAASPVPSSSFFHDSRKPSTSTHLVRRGTPLGIYQTNLYGHNSRENTNPNSTLLSSKLLAHPPVPYGQNPDLLQHAVYRAQPSSGTTNAQPRQTTRRYQSHKSRPAFVNKLWSMLNDDSNTKLIQWAEDGKSFIVTNREEFVHQILPKYFKHSNFASFVRQLNMYGWHKVQDVKSGSIQSSSDDKWQFENENFIRGREDLLEKIIRQKGSSNNHNSPSGNGNPANGSNIPLDNAAGSNNSNNNISSSNSFFNNGHLLQGKTLRLMNEANLGDKNDVTAILGELEQIKYNQIAISKDLLRINKDNELLWQENMMARERHRTQQQALEKMFRFLTSIVPHLDPKMIMDGLGDPKVNNEKLNSANNIGLNRDNTGTIDELKSNDSFINDDRNSFTNATTNARNNMSPNNDDNSIDTASTNTTNRKKNIDENIKNNNDIINDIIFNTNLANNLSNYNSNNNAGSPIRPYKQRYLLKNRANSSTSSENPSLTPFDIESNNDRKISEIPFDDEEEEETDFRPFTSRDPNNQTSENTFDPNRFTMLSDDDLKKDSHTNDNKHNESDLFWDNVHRNIDEQDARLQNLENMVHILSPGYPNKSFNNKTSSTNTNSNMESAVNVNSPGFNLQDYLTGESNSPNSVHSVPSNGSGSTPLPMPNDNDTEHASTSVNQGENGSGLTPFLTVDDHTLNDNNTSEGSTRVSPDIKFSATENTKVSDNLPSFNDHSYSTQADTAPENAKKRFVEEIPEPAIVEIQDPTEYNDHRLPKRAKK'
# 5' primer to add at "N" terminal (left of the sequence)
p5=${Protein_Sequence:463:30}
header=true # file has header and I have to skip it
# open file containing the sequence fused at the right of p5
for insert in `cat $1 | awk 'BEGIN{FS=","}{print $2}'`
do
# if header, then continue with next iteration and flag header as false
if [ $header = true ]
then
header=false
else
printf ">${insert}\n${p5}${insert}" > ${insert}.fasta # write fasta file (this is the input of psipred)
# check how many processes are in the queue
queue=$(squeue -u aerijman | wc -l)
queue=$(echo $queue -1 | bc)
# if few processes queued, proceed, else wait.
if [ $queue -lt $MAX_QUEUE ]
then
sbatch -p campus -c 1 --job-name=${insert} --wrap="runpsipred ${insert}.fasta"
else
# take the chance to find *horiz files which contain the result
for prefix in `ls *horiz`
do
# extract the resulting sequence of 2ry structure elements and append it to a ingle file with all esults
horiz=$(while read line; do if [ "${line:0:4}" == Pred ]; then echo ${line:6:${#line}} | tr -d "\n"; fi; done < $prefix)
printf ">${p5}${insert}\n${horiz}" >> horiz.results
# rm all side files (from psipred-blast)
rm ${prefix:0:-5}*
done
# This loop is tracking if any process has finished (so a new processes can ve queued)
while [ $queue -ge $MAX_QUEUE ]
do
queue=$(squeue -u aerijman | wc -l)
queue=$(echo $queue -1 | bc)
done
fi
fi
done
I apologize for including too much irrelevant information in this script, but I believe that my amateur way of making a loop watching for a vacancy in the queue can be changed for something smarter.
Any help will be appreciated!