-1

I would like to gather proteins FASTA sequence from Entrez with python 2.7. I am looking for any proteins that have the keywords: "terminase" and "large" in their name. So far I got this code:

from Bio import Entrez
Entrez.email = "example@example.org"


searchResultHandle = Entrez.esearch(db="protein", term="terminase large", retmax=1000)
searchResult = Entrez.read(searchResultHandle)
ids = searchResult["IdList"]

handle = Entrez.efetch(db="protein", id=ids, rettype="fasta", retmode="text")
record = handle.read()

out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))

However it can get me several terminases from various organisms, while I need only terminase form bacteriophage (specificly Viruses [taxid 10239], host bacteria. I've managed to get the nuccore accession ids from NCBI of the viruses I am intersted in, but I don't know how to combine those two informations. The id file looks like this:

NC_001341
NC_001447
NC_028834
NC_023556
...

Do I need to access every gb file of every ID and search for my desired protein in it?

tahunami
  • 141
  • 1
  • 7

1 Answers1

1

Found what I was looking for. In:

searchResultHandle = Entrez.esearch(db="protein", term="terminase large", retmax=1000)

I've added:

searchterm = "(terminase large subunit AND viruses[Organism]) AND Caudovirales AND refseq[Filter]"
searchResultHandle = Entrez.esearch(db="protein", term=searchterm, retmax=6000)

which norrowed down my searches to the desired viruses. Granted it's not filtered by host, but by a taxonomy group, but it is enough for my work.

Thank you @Llopis for additional help

tahunami
  • 141
  • 1
  • 7