1

I am trying to parse a few dozen sequences through BLAST, using Bio.Blast with NCBIWWW, in Python 2.7. Not a problem there with one or a couple sequences, but the NCBIWWW.qblast() always stops after about 5-7 iterative BLAST searches. Importantly, the program does not crash and exit with an error - it just stalls, and freezes for ever. I have to exit the application manually. This is not a problem with Internet connection, either - no errors that would suggest this.

I have no idea what is wrong. Is there a mistake in my code that prevents multiple BLAST searches, or are there alternative algorithms for this purpose?

My code:

    from Bio.Blast import NCBIWWW
    import urllib

    def load_uniprot_fasta(identifier): #loads fasta file for a given UniProt identifier
        link = "http://www.uniprot.org/uniprot/" + identifier + ".fasta"

        f = urllib.urlopen(link)
        content = f.read()
        print content
        print "\n"
        new_file = open(str(identifier)+".seq", "w")
        new_file.write(content)


    evalue = 0.00001

    id_list = open("list.list", "r") #this file is a list of UniProt identifiers, every line is a new identifier

    for line in id_list:

        uniprot_id = ""
        uniprot_id = str(line).strip("\n")
        load_uniprot_fasta(uniprot_id) #creates a <uniprot_id>.fasta file
        fasta_object = open(str(uniprot_id)+".seq").read()
        result_handle = NCBIWWW.qblast("blastp", "swissprot", fasta_object)
        print "SUCCESS\n"
DrOrpheum
  • 23
  • 6
  • Try closing all of your files after having written to them. No idea if that's the issue but they should be closed anyway. Or, better, use the context manager `with()`. – roganjosh Nov 13 '17 at 02:21
  • @roganjosh Thanks for the tip. I just tried it and unfortunately it doesn't fix the problem. – DrOrpheum Nov 13 '17 at 02:43
  • Can you provide the file that is causing the issue? Try printing `fasta_object` before calling `result_handle` to locate it. – rodgdor Nov 13 '17 at 13:45
  • @rodgdor I can print fasta_object and it looks just fine. Definitely not a problem with that. As to the file - any file formatted correctly causes this problem. As an example, you can try this: https://www.dropbox.com/s/loaogorfc3sz6qg/list.list?dl=0 – DrOrpheum Nov 13 '17 at 14:21
  • Okay but can you tell me what is the `uniprot_id` that is causing python to freeze? – rodgdor Nov 13 '17 at 15:18
  • @DrOrpheum The error must come from an infinite loop in the [source code](https://github.com/biopython/biopython/blob/master/Bio/Blast/NCBIWWW.py) of the function. – rodgdor Nov 13 '17 at 15:29
  • @rodgdor Any uniprot_id can cause it to freeze. In that list of 20-or-so uniprot_ids, sometimes the program will freeze on the 1st, sometimes the 2nd, sometimes the 8th. I haven't seen it go past 9 sequences. – DrOrpheum Nov 13 '17 at 17:41
  • 1
    @DrOrpheum That is very odd, I think it's an issue with the code, take a look at the `while True` loop in the source code. Maybe you should raise the issue on github? – rodgdor Nov 13 '17 at 19:27

1 Answers1

0

I don't see any intentional delay in your code -- have you read the NCBI BLAST Usage Guidlines:

The NCBI BLAST servers are a shared resource. We give priority to interactive users. ... To avoid problems, API users should comply with the following guidelines:

  • Do not contact the server more often than once every 10 seconds.
  • Do not poll for any single RID more often than once a minute.
  • Use the URL parameter email and tool, so that the NCBI can contact you if there is a problem.
  • Run scripts weekends or between 9 pm and 5 am Eastern time on weekdays if more than 50 searches will be submitted.

Although NCBIWWW has a delay mechanism for how often it checks for results from a query, I don't see that it adds delays between queries. I'm not saying this is definitely your issue, but you could be outside the NCBI guidelines. Another piece of advice I've seen with respect to this issue:

DO NOT submit searches that contain only single sequence! You need to batch the query and submit a set in a single search request.

cdlane
  • 40,441
  • 5
  • 32
  • 81
  • Thank you for your suggestions. I've tried it before and unfortunately, adding a 10 sec delay doesn't fix the issue. I've tried running the script at different times of night and day and it also doesn't change anything. As to your last tip - I don't really understand it. If I submit more than one sequence at a time, won't the result be just a multiple sequence alignment between the submitted sequences. If I am wrong, how do I "batch the query" and submit a whole set in one go? – DrOrpheum Nov 14 '17 at 13:31