0

I am creating a code that takes accession numbers from an existing excel file and ultimately running it through BLAST using Entrez. I am very new to Python and pretty much just learning as a I write. Here is what my code looks like right now, it currently will run and run but never complete and never return any errors. Any help would be greatly appreciated!

import pandas
from Bio import SeqIO
from Bio import Entrez
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML 

genesexcel=pandas.read_excel(r'C:\Users\bmn45\OneDrive - Drexel University\Monoterpenoid Indole Alkaloids\MIA_genes.xlsb', sheet_name='Data S16')  
#import the excel into variable

acc_num=genesexcel['Accession number'].values
#store values from first column of excel
print(acc_num)

for number in acc_num:

    handle = Entrez.esearch(db='protein', term=number, retmax="20")
    #searching the Genbank database using the accession number

    Entrez_result = Entrez.read(handle)
  
    ID_List = Entrez_result['IdList']
    #storing the ID number(s) from the Entrez search of the accession number

    handle2 = Entrez.efetch(db='protein', id=ID_List, rettype='gb')
    #uses efecth to get the full sequence record/file using the searched ID number
    #currently, the ID value is hardcoded, database is protein and the return file type (rettype) is set as a genbank file(gb)

    seq_record = SeqIO.read(handle2, "genbank")
    #storing the full sequence record as a genbank file
    seq_ID=(seq_record.name) 
    #storing the name of the sequence record (the full sequence ID)

    result_handle = NCBIWWW.qblast("tblastn", "nt", seq_ID, entrez_query="txid4056[ORGN]")
    #searching the nucleotide database ("nt") from protein sequence using tblastn, refined search related to Apocynacae (taxID4056)

    with open("my_blast_result.xml", "w") as out_handle:
        out_handle.write(result_handle.read())
    result_handle.close()
    #saving the blast results into a file

    result_handle = open("my_blast_result.xml")
    #loading saved BLAST results

    blast_records = NCBIXML.parse(result_handle)
    #parsing (analying) through the returned hits

    print("Alignments for sequence", next(blast_records).query)
    for alignment in blast_records:
        print("Accession number:", alignment.accession)
        print("Sequence:", alignment.title)
        print("Length:", alignment.length)
        print()
  
    print(result_handle.read())

I have tried running the code with a hard-coded accession number and ID (without looping through) and it worked just fine so I know my problem lies with how my loop is set up.

53RT
  • 649
  • 3
  • 20
  • Hello and welcome to StackOverflow. Thank you for your question. I see you have `print` statements inside the loops -- that could help us. Is it printing as expected, or you don't see any console output at all? – Matej May 30 '23 at 19:20
  • Maybe you can explain what the `print(acc_num)` shows this would be my first guess causing problems. If the content of your loop works fine with hardcoded numbers you could uncomment parts of the code inside of you loop and check if the loop correctly iterates over you numbers. Maybe you can simplify your example code by deleting unnecessary parts. – 53RT May 30 '23 at 19:21
  • 3
    Remove your personal email from code, just write there a fake one or says you need an email there, crawler could intercept it and flood your inbox with spam – pippo1980 May 31 '23 at 05:40
  • Please provide a fewlines of your input and expected output. See here for info https://stackoverflow.com/help/minimal-reproducible-example – pippo1980 May 31 '23 at 05:44
  • @Matej The loop continues to run forever and any print() command that is inside the loop is never outputted – Brianna Nissley Jun 05 '23 at 14:31
  • 1
    Thank you @53RT, I believe my code should be working now after changing my acc_num variable to a list! It is running and giving me outputs as it goes now. – Brianna Nissley Jun 06 '23 at 16:21

1 Answers1

0

I am currently experiencing the same kind of issue: my blast searches take on average 3 minutes to return a result but sometimes, after 12 hours, it is still searching. restarting the exact same blast ( or doing it on the website) may just take a few minutes. My advice would be :

  • to put some timestamp at each step so you know if your program is still moving forward.
  • save your result each time your have done with your result so when blast send you some error, you don't have to restart everything.