I am trying to get the protein sequence from NCBI via a gene id (GI) number, using Biopython's Entrez.fetch()
function.
proteina = Entrez.efetch(db="protein", id= gi, rettype="gb", retmode="xml").
I then read the data using:
proteinaXML = Entrez.read(proteina).
I can print the results, however I don't know how to get the protein sequence alone.
I can reach the protein manually once the result is displayed. Or I I check the XML tree using:
proteinaXML[0]["GBSeq_feature-table"][2]["GBFeature_quals"][6]['GBQualifier_value'].
However, depending on the GI of the protein submitted, the XML tree can differ. Making it difficult to automate this process robustly.
My question: Is it possible to retrieve only the protein sequence, and not the entire XML tree? Or alternatively: How can I extract the protein sequence from the XML file, given that the structure of XML files can differ from protein to protein?
Thanks