Ncbi protein database, how to get protein sequences from a specific bioproject (python script)

Question

I am trying to retrieve codding protein sequences from NCBI database from specific bioprojects. This can be achieved somehow using a web browser. For instance you can find the specific bioproject you are interested in and "click" on the associated protein : http://www.ncbi.nlm.nih.gov/genome/proteins/994?project_id=207383 which allow you to see all the protein from the BioProject "207383" and for the Genome "994". I would like to get thoses protein sequencies automaticaly using python.

In order to do that i used the "E-utilities" from NCBI. Mainly "elink.fcgi?" which allow to get all the UID of a database (lets say "Protein") linked from a specific UID of a database (lets say a BioProject UID). So here is my entrez URL request :
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=bioproject&linkname=bioproject_protein&id=207383
Then i obtain a list of Protein UID, which is great since i need those, for my next request with the efetch.fcgi? "E-utility". This request would then allow me to get everything i need.

OK, so everything is great and all, it works fine BUT, the number of protein UID i get from my "elink.fcgi?" request isn't the same as the number of protein displayed with a manual web broswer based search. Worse, upon inquiring the origin of these issues, you get to see missing sequencies or sequencies from higher taxa (which are also not linked in any way to the BioProject).

Here is an exemple : the first link of this post display a number of 4014 sequencies, when the python request get me 3957 Protein UID.

I tried some other approaches such as getting all the protein UID linked from a taxonomy UID. This usualy give you more sequencies than wanted since there are different bioprojects (also give you some doubles with different names and same Fasta).

Is there a way to do this, one which migth work?

score 2 · Answer 1 · answered Mar 22 '14 at 00:36

I also find working with NCBI extremely frustrating. I am amazed that such a data source doesn't even provide us with a clean cut way of download. Instead, it offers some terrible cross linkings and let the users go figure the whole thing themselves.

My solution is from this post

How to Download Bacterial Genomes Using the Entrez API

Be sure change the db to "nuccore" and rettype to "fasta_cds_aa". Also check the downloaded fasta file for its taxonomy id to make sure it is exactly the strain you ask (This last one messed me up big time, hard learned lesson).

Ncbi protein database, how to get protein sequences from a specific bioproject (python script)

1 Answers1