0

I have a file (fasta) that I am using awk to extract the needed fields from (sequences with their headers). I then pipe it to a BLAST program and finally I pipe it to qsub in order to submit a job. the file:

>sequence_1
ACTGACTGACTGACTG
>sequence_2
ACTGGTCAGTCAGTAA
>sequence_3
CCGTTGAGTAGAAGAA

and the command (which works):

awk < fasta.fasta '/^>/ { print $0 } $0 !~ /^>/' | echo "/Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -outfmt 5 >> /User/blastresult.xml" | qsun -q S

what I would like to do is a add a condition that will sample the number of jobs I am running (using qstat) if it is below a certain threshold the job will be submitted. for example:

allowed_jobs=200 #for example 
awk < fasta.fasta '/^>/ { print $0 } $0 !~ /^>/' | echo "/Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -outfmt 5 >> /User/blastresult.xml" | cmd=$(qstat -u User | grep -c ".") | if [ $cmd -lt $allowed_jobs ]; then  qsub -q S

unfortunately (for me anyway) I have failed in all my attempts to do that. I'd be grateful for any help

EDIT: elaborating a bit: what I am trying to do is to extract from the fasta file this:

>sequene_x
ACTATATATATA

or basically: >HEADER\nSEQUENCE one by one and pipe it to the blast program which can take stdin. I want to create a unique job for each sequence and this is the reason I want to pipe to qsub for each sequence. to put it plainly the qsub submission would have looked something like this:

qsub -q S /Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -query FASTA_SEQUENCE -outfmt 5 >> /User/blastresult.xml

note that the -query flag is unnecessary if stdin sequence is piped to it. however, the main problem for me is how to incorporate the condition I mentioned above so that the sequence will be piped to qsub only if the qstat result is below a threshold. ideally if the qstat result is above the threshold it'll sleep until i goes below and then pass it forward.

thanks.

  • If `qsub` runs in the background, can't you just count the number of processes with `ps ax | grep -wc '[q]sub'`? Or, better than just not running jobs, why not submit them to a queue, then write a queue runner that spawns `qsub` on the first 200 jobs, then spawns new instances as the old ones finish? That would be an entirely different question, of course. – ghoti Nov 13 '12 at 14:10
  • 4
    Your first awk call does not do filtering: it prints the whole file, the same way cat does. In fact, the output from your awk command is not used at all: the echo command does not take input from the standard input (the pipe in this case). Perhaps, you tell us what you try to accomplish: what the input looks like, what the output should be. – Hai Vu Nov 13 '12 at 15:57
  • @HaiVu I have added some more information about my problem above. thanks. – Schrodinger's Cat Nov 13 '12 at 18:59

2 Answers2

2

Hello I guess this is answered since long now.

I'll just provide a way to solve this, by counting the lines that should be processed (sequences) before passing it over to awk, the awk piece would go where echo time to work is.

#!/bin/bash
ct=`grep -c '^>' fasta.fasta`
if [ $ct -lt 201 ] ; then 
    echo time to work
else
    echo too much
fi
McUsr
  • 1,400
  • 13
  • 10
  • thanks for you attempt. however, this is not my problem. I am looking for a way to send sequences on by one to blast with qsub. your answer does not address that. – Schrodinger's Cat Dec 25 '12 at 12:03
0

This bit of shell reads two lines, prints them to stdout and pipes into your qsub command

while IFS= read -r header; do
    IFS= read -r sequence
    printf "%s\n" "$header" "$sequence" | 
    qsub -q S /Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -outfmt 5 >> /User/blastresult.xml
done < fasta.fasta
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • Thanks for that. my problem,however, is that I have problem in adding a condition to the pipe that will check the number of jobs a user is running and will only pass the command to qsub if below the threshold. – Schrodinger's Cat Nov 13 '12 at 21:47