Biopython Retrieving protein transcripts for a protein coding gene

Question

I am using biopython's wrapper API for ncbi eutils to retrieve related proteins, identical proteins and variant proteins (transcripts, splice variants, etc) for a certain protein coding gene.

This information is displayed for a protein coding gene on its ncbi page under the "mRNA and Protein(s)" section.

I am retrieving identical proteins via LinkName=protein_protein_identical and related via LinkName=protein_protein.

Example call

Is there a way to retrieve the transcripts for a protein coding gene?

score 0 · Accepted Answer · answered Jul 14 '14 at 08:31

It's easy but annoying (XML craziness involved). First you retrieve your record from Entrez:

handle = Entrez.efetch(db="gene",
                       id="10555",
                       retmode="xml")

Now handle is a generator for XML lines. You can parse them with Entrez.parse() from Biopython, but I find the XML too entangled to deal with it. Your mRNA ids are in

<Entrezgene_comments>
 <Gene-commentary>
  <Gene-commentary_comment>
   <Gene-commentary>
    <Gene-commentary_products>
     <Gene-commentary>
      <Gene-commentary_type value="mRNA">
       <Gene-commentary_products>
        <Gene-commentary>
         <Gene-commentary_type value="peptide">
          <Gene-commentary_accession>NP_001012745</Gene-commentary_accession>

After parsing with Entrez.parse() you'll have a mix of dicts with lists to dive in until you reach your accession id. Once you have this id, you can ask for the sequence to entrez with:

handle = Entrez.efetch(db="protein",
                       id="NP_001012745",
                       rettype="fasta",
                       retmode="text")

An alternative approach involves parsing a gene_table. Fetch the same handle than before, but instead of a XML ask for a gene_table:

handle = Entrez.efetch(db="gene",
                       id="10555",
                       rettype="gene_table",
                       retmode="text")

In the gene_table you'll find some lines in the form:

mRNA transcript variant 2 NM_001012727.1
protein isoform b precursor NP_001012745.1
Exon table for  mRNA  NM_001012727.1 and protein NP_001012745.1

From where you can get your ids.

@user2764, if you found the answer useful, accept it. It will get marked as "answered", and both of us get a bunch of saucy internet fake points. — xbello, Aug 06 '14 at 14:30

Biopython Retrieving protein transcripts for a protein coding gene

1 Answers1