4

I am new to Stackoverflow. I am trying to automate search process using Biopython. I have two lists, one with protein GI numbers and one with corresponding nucleotide GI numbers. For example:

protein_GI=[588489721,788136950,409084506]

nucleo_GI=[588489708,788136846,409084493]

Second list was created using ELink. However, the nucleotide GIs correspond to whole genomes. I need to retrieve particular CDS from each genome which match the protein GI. I tried using again ELink with different link names ("protein_nucleotide_cds","protein_nuccore") but all I get is id numbers for whole genomes. Should I try some other link names? I also tried the following EFetch code:

import Bio
from Bio import Entrez
Entrez.email = None

handle=Entrez.efetch(db="sequences",id="588489708,588489721",rettype="fasta",retmode="text")
print(handle.read())

This method gives me nucleotide and protein sequences in fasta file but the nucleotide sequence is a whole genome.

I would be very grateful, if somebody could help me. Thanking you in advance!

anoviks
  • 43
  • 4
  • I tried your example and don't see anything wrong with it. I get a fasta with two records, each corresponding to each ID. – cnluzon Aug 18 '15 at 10:11
  • Hi @cnluzon, my problem is that the nucleotide sequence is a whole genome. I would like to extract only the coding sequence of that protein. The particular CDS can be retrieved manually from NCBI website but I cannot automate this step. – anoviks Aug 18 '15 at 16:03
  • So the first record you get is a whole genome? But it correspond to this GI ID you get. If you use the online Entrez utilities you get exactly the same record. I'm just guessing, but maybe the problem is in the previous step, where you get the GI IDs. – cnluzon Aug 19 '15 at 08:18

1 Answers1

3

I hope help you

import Bio
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail@example.com"

gi_protein = "GI:588489721"
gi_genome = "GI:588489708"

handle=Entrez.efetch(db="sequences", id=gi_protein,rettype="fasta", retmode="text")
protein = SeqIO.parse(handle, "fasta").next()

handle=Entrez.efetch(db="sequences", id=gi_genome, rettype="gbwithparts", retmode="text")
genome = SeqIO.parse(handle, "gb").next()

#to extract feature with 'id' equal to protein
feature = [f for f in gb.features if "db_xref" in f.qualifiers and gi_protein in f.qualifiers["db_xref"]]

#to get location of CDS
start = feature[0].location.start.position
end = feature[0].location.end.position
strand = feature[0].location.strand 

seq = genome[start: end]

if strand == 1:
  print seq.seq
else:
  #if strand is -1 then to get reverse complement
  print seq.reverse_complement().seq

print protein.seq

then you get:

ATGGATTATATTGTTTCAGCACGAAAATATCGTCCCTCTACCTTTGTTTCGGTGGTAGGG
CAGCAGAACATCACCACTACATTAAAAAATGCCATTAAAGGCAGTCAACTGGCACACGCC
TATCTTTTTTGCGGACCGCGAGGTGTGGGAAAGACGACTTGTGCCCGTATCTTTGCTAAA
ACCATCAACTGTTCGAATATATCAGCTGATTTTGAAGCGTGTAATGAGTGTGAATCCTGT
AAGTCTTTTAATGAGAATCGTTCTTATAATATTCATGAACTGGATGGAGCCTCCAATAAC
TCAGTAGAGGATATCAGGAGTCTGATTGATAAAGTTCGTGTTCCACCTCAGATAGGTAGT
TATAGTGTATATATTATCGATGAGGTTCACATGTTATCGCAGGCAGCTTTTAATGCTTTT
CTTAAAACATTGGAAGAGCCACCCAAGCATGCCATCTTTATTTTGGCCACTACTGAAAAA
CATAAAATACTACCAACGATCCTGTCTCGTTGCCAGATTTACGATTTTAATAGGATTACC
ATTGAAGATGCGGTAGGTCATTTAAAATATGTAGCAGAGAGTGAGCATATAACTGTGGAA
GAAGAGGGGTTAACCGTCATTGCACAAAAAGCTGATGGAGCTATGCGGGATGCACTTTCC
ATCTTTGATCAGATTGTGGCTTTCTCAGGTAAAAGTATCAGCTATCAGCAAGTAATCGAT
AATTTGAATGTATTGGATTATGATTTTTACTTTAGGTTGGTGGATGCTTTTCTGGCAGAA
GATACTACACAAACACTATTGATTTTTGATGAGATATTGAAACGGGGATTTGATGCACAT
CATTTTATTTCCGGTTTAAGTTCTCATTTGCGTGATTTACTTGTATGTAAGGATGCAGCC
ACCATTCAGTTGCTGGATGTGGGTGCTAAAATTAAGGAGAAGTACGGTGTTCAGGCGCAA
AAAAGTACGATTGACTTTTTAATGGATGCTTTAAATATTACCAACGATTGCGATTTGCAA
TATAGGGTGGCTAAAAATAAGCGTTTGCATGTGGAGTTTGCTCTTCTTAAGATAGCACGT
GTATTAGATGAACAAAGAAAAAAGTAG

MDYIVSARKYRPSTFVSVVGQQNITTTLKNAIKGSQLAHAYLFCGPRGVGKTTCARIFAK
TINCSNISADFEACNECESCKSFNENRSYNIHELDGASNNSVEDIRSLIDKVRVPPQIGS
YSVYIIDEVHMLSQAAFNAFLKTLEEPPKHAIFILATTEKHKILPTILSRCQIYDFNRIT
IEDAVGHLKYVAESEHITVEEEGLTVIAQKADGAMRDALSIFDQIVAFSGKSISYQQVID
NLNVLDYDFYFRLVDAFLAEDTTQTLLIFDEILKRGFDAHHFISGLSSHLRDLLVCKDAA
TIQLLDVGAKIKEKYGVQAQKSTIDFLMDALNITNDCDLQYRVAKNKRLHVEFALLKIAR
VLDEQRKK
Jose Ricardo Bustos M.
  • 8,016
  • 6
  • 40
  • 62
  • 2
    This is brilliant! Thanks a lot Jose! I work with Python 3 so I had to change few things but the overall code is very good. I just started with Biopython and I am still learning. I hope one day I will be able to help others like you helped me :) – anoviks Aug 19 '15 at 12:10