I am trying to retrieve codding protein sequences from NCBI database from specific bioprojects. This can be achieved somehow using a web browser. For instance you can find the specific bioproject you are interested in and "click" on the associated protein : http://www.ncbi.nlm.nih.gov/genome/proteins/994?project_id=207383 which allow you to see all the protein from the BioProject "207383" and for the Genome "994". I would like to get thoses protein sequencies automaticaly using python.
In order to do that i used the "E-utilities" from NCBI. Mainly "elink.fcgi?" which allow to get all the UID of a database (lets say "Protein") linked from a specific UID of a database (lets say a BioProject UID). So here is my entrez URL request :
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=bioproject&linkname=bioproject_protein&id=207383
Then i obtain a list of Protein UID, which is great since i need those, for my next request with the efetch.fcgi? "E-utility". This request would then allow me to get everything i need.
OK, so everything is great and all, it works fine BUT, the number of protein UID i get from my "elink.fcgi?" request isn't the same as the number of protein displayed with a manual web broswer based search. Worse, upon inquiring the origin of these issues, you get to see missing sequencies or sequencies from higher taxa (which are also not linked in any way to the BioProject).
Here is an exemple : the first link of this post display a number of 4014 sequencies, when the python request get me 3957 Protein UID.
I tried some other approaches such as getting all the protein UID linked from a taxonomy UID. This usualy give you more sequencies than wanted since there are different bioprojects (also give you some doubles with different names and same Fasta).
Is there a way to do this, one which migth work?