Differences between Biopython NCBI/Entrez esummary output and R package output

Question

I am new to accessing Entrez through Biopython and a couple of R packages (rentrez and reutil). When accessing the 'nuccore' database with esummary, the output fields returned by Biopython are different than that returned by the R packages.

Python:

handle = Entrez.esearch(db='nuccore', term='183844[GPRJ]', retmax=75000)
record = Entrez.read(handle)
id_list = record["IdList"]
search_results = Entrez.read(Entrez.epost("nuccore", id=",".join(id_list), restart=1, retmax=10000))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
handle1 = Entrez.esummary(db="nuccore", query_key=query_key, WebEnv=webenv)
record1 = Entrez.read(handle1)

The fields returned by Biopython are:

['AccessionVersion','Caption','Comment','CreateDate','Extra','Flags','Gi','Id', 'Item','Length','ReplacedBy','Status','TaxId','Title','UpdateDate']

R (reutil package):

trak <- esearch('183844[GPRJ]', "nuccore", usehistory=TRUE, retmax = 70000)
query_key <- 1
web_env <- "NCID_1_224566406_130.14.18.34_9001_1496371219_1582367639_0MetA0_S_MegaStore_F_1"
esum <- esummary(db="nuccore", querykey = query_key, webenv = web_env, retstart = 1, retmax = 10000)
gtrkr <- content(esum, "parsed")

While the fields returned by R packages reutil and rentrez are: esummary result with 31 items:

['uid', 'caption', 'title', 'extra', 'gi', 'createdate', 'updatedate', 'flags', 'taxid', 'slen', 'biomol', 'moltype', 'topology', 'sourcedb', 'segsetsize', 'projectid', 'genome', 'subtype', 'subname', 'assemblygi', 'assemblyacc', 'tech', 'completeness', 'geneticcode', 'strand', 'organism', 'strain', 'biosample', 'statistics', 'properties', 'oslt']

Thanks in advance.

My guess is that `reutil` returns all item names even if they have no value, but the Python code is dropping items without a value. They both access the same underlying API data, so the difference must be in processing the result. — neilfws, Jun 02 '17 at 05:48
No, there are data in the fields returned by reutil (rather important data at that), which is why it was frustrating to not have these metadata after going through the Python route. — vethno, Jun 02 '17 at 05:51
What I mean is that whilst the R package returns 31 items per `id`, some of them are empty strings. I used the `rentrez` package with your query and it had the same behaviour: every `id` had those 31 fields but the values are empty string for some cases. My suspicion is that the Python code just drops those ones. — neilfws, Jun 02 '17 at 13:27
Understood your point, but it seems like a major flaw when many are not empty strings and they are dropped by biopython, if that's the case. It's more than that because entire fields that have data in the R pull are missing with the Python. — vethno, Jun 02 '17 at 22:28
Got it, thanks. I'll have to check to see if there's a way in the call to include fields with some empty strings. — vethno, Jun 06 '17 at 22:36
You Biopython example is incomplete - it needs ``from Bio import Entrez`` and to explain how you got to ``['AccessionVersion','Caption','Comment','CreateDate','Extra','Flags','Gi','Id', 'Item','Length','ReplacedBy','Status','TaxId','Title','UpdateDate']`` — Peter Cock, Jul 13 '17 at 14:58
This issue was opened by David here: https://github.com/biopython/biopython/issues/1389 — Michiel, Sep 13 '17 at 23:40

david w · Answer 1 · 2017-09-10T22:34:37.980

Coming to this late, but as a past contributor to biopython and maintainer of rentrez I feel I need to explain what is going on here.

Biopython is accessing "version 1.0" esummary records by default, and the R packages are fetching "version 2.0" records. There is a brief discussion about the differences between these records in the rentrez help page:

The NCBI offer two distinct formats for summary documents. Version 1.0 is a relatively limited summary of a database record based on a shared Document Type Definition. Version 1.0 summaries are only available as XML and are not available for some newer databases Version 2.0 summaries generally contain more information about a given record, but each database has its own distinct format. 2.0 summaries are available for records in all databases and as JSON and XML files. As of version 0.4, rentrez fetches version 2.0 summaries by default and uses JSON as the exchange format (as JSON object can be more easily converted into native R types). Existing scripts which relied on the structure and naming of the "Version 1.0" summary files can be updated by setting the new ‘version’ argument to "1.0".

And just to demonstrate changing this argument reproduces the results from Biopython.

> eg_gene <- entrez_search(db="nuccore", term='183844[GPRJ]', retmax=1)
> entrez_summary(db="nuccore", id=eg_gene$ids, version="1.0")
esummary result with 13 items:
 [1] Caption          Title            Extra            Gi              
 [5] CreateDate       UpdateDate       Flags            TaxId           
 [9] Length           Status           ReplacedBy       Comment         
[13] AccessionVersion
> entrez_summary(db="nuccore", id=eg_gene$ids)
esummary result with 31 items:
 [1] uid          caption      title        extra        gi          
 [6] createdate   updatedate   flags        taxid        slen        
[11] biomol       moltype      topology     sourcedb     segsetsize  
[16] projectid    genome       subtype      subname      assemblygi  
[21] assemblyacc  tech         completeness geneticcode  strand      
[26] organism     strain       biosample    statistics   properties  
[31] oslt

Edit -- getting version 2.0 records with Biopython

handle = Entrez.esearch(db="nuccore", term="183844[GPRJ]", retmax=1)
record = Entrez.read(handle) 
handle_two = Entrez.esummary(db="nuccore", id=record["IdList"][0], version="2.0")
Entrez.read(handle_two, validate=False)

.

{'DocumentSummarySet': ListElement([ListElement(['NPMJ00000000', 'Salmonella enterica subsp. enterica serovar Johannesburg strain CFSAN059880, whole genome shotgun sequencing project', 'gi|1235597280|gb|NPMJ00000000.1|NPMJ01000000', '1235597280', '2017/08/22', '2017/08/22', '0', '913076', '48', 'genomic', 'dna', 'linear', 'insd', '0', '186035', '', 'strain|serovar|host|sub_species|country|isolation_source|collection_date|collected_by', 'CFSAN059880|Johannesburg|Bos taurus|enterica|Nigeria|cattle stool|2012|University of Ibadan', '0', '', 'wgs', '', '11', '', 'Salmonella enterica subsp. enterica serovar Johannesburg', 'CFSAN059880', [StringElement('', attributes={'count': '1', 'type': 'all'}), StringElement('', attributes={'count': '3500', 'type': 'blob_size'}), StringElement('', attributes={'count': '1', 'type': 'org'}), StringElement('', attributes={'count': '2', 'type': 'pub'}), StringElement('', attributes={'count': '1', 'subtype': 'unpublished', 'type': 'pub'}), StringElement('', attributes={'count': '1', 'source': 'all', 'type': 'all'}), StringElement('', attributes={'count': '3500', 'source': 'all', 'type': 'blob_size'}), StringElement('', attributes={'count': '1', 'source': 'all', 'type': 'org'}), StringElement('', attributes={'count': '2', 'source': 'all', 'type': 'pub'})], StringElement('1', attributes={'master': '1', 'na': '1'}), StringElement('NPMJ00000000.1', attributes={'indexed': 'yes'}), 'NPMJ00000000.1'], attributes={'uid': '1235597280'})], attributes={'status': 'OK'})}

Thanks David. Now, how would we get Biopython to fetch the version 2 summary XML? — Peter Cock, Sep 08 '17 at 11:19
Hi Peter, it is possible to add `version='2.0'` to a call to `Entrez.summary`. That will fetch the XML file but Biopython won't have every DTD to read the results (yet). Have added a small example to the answer, can open an issue at the Biopython repo if you want to go further into detail? — david w, Sep 10 '17 at 22:28
Thank you. It sounds like the Biopython documentation would benefit from this information - if you don't mind opening an issue or pull request, that'd be great - cheers! — Peter Cock, Sep 11 '17 at 12:23

score 1 · Answer 2 · answered Jul 13 '17 at 15:10

To explain the Biopython example:

from Bio import Entrez
handle = Entrez.esearch(db='nuccore', term='183844[GPRJ]', retmax=75000)
record = Entrez.read(handle)
id_list = record["IdList"]
search_results = Entrez.read(Entrez.epost("nuccore", id=",".join(id_list), restart=1, retmax=10000))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
handle1 = Entrez.esummary(db="nuccore", query_key=query_key, WebEnv=webenv)
record1 = Entrez.read(handle1)

Now this should confirm their are 1000 entries (matching retmax), and each has 15 fields:

print(len(record1))
for entry in record1:
    assert len(entry) == 15
print(record1[0])

That should give:

1000
{'Item': [], 'Id': '1102582672', 'Caption': 'MEKF00000000', 'Title': 'Salmonella enterica subsp. enterica serovar Sandiego strain CFSAN039537, whole genome shotgun sequencing project', 'Extra': 'gi|1102582672|gb|MEKF00000000.1|MEKF01000000[1102582672]', 'Gi': 1102582672, 'CreateDate': '2016/11/14', 'UpdateDate': '2017/07/11', 'Flags': 0, 'TaxId': 0, 'Length': 93, 'Status': 'live', 'ReplacedBy': '', 'Comment': '  ', 'AccessionVersion': 'MEKF00000000.1'}

As an aside, I'm not sure what the 'Item' empty list is from.

Let's check the actual raw XML for the first record using retmax=1

from Bio import Entrez
handle = Entrez.esearch(db='nuccore', term='183844[GPRJ]', retmax=1)
record = Entrez.read(handle)
id_list = record["IdList"]
search_results = Entrez.read(Entrez.epost("nuccore", id=",".join(id_list), restart=1, retmax=10000))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
handle1 = Entrez.esummary(db="nuccore", query_key=query_key, WebEnv=webenv)
print(handle1.read())

This gives:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary v1 20041029//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20041029/esummary-v1.dtd">
<eSummaryResult>
<DocSum>
    <Id>1102582672</Id>
    <Item Name="Caption" Type="String">MEKF00000000</Item>
    <Item Name="Title" Type="String">Salmonella enterica subsp. enterica serovar Sandiego strain CFSAN039537, whole genome shotgun sequencing project</Item>
    <Item Name="Extra" Type="String">gi|1102582672|gb|MEKF00000000.1|MEKF01000000[1102582672]</Item>
    <Item Name="Gi" Type="Integer">1102582672</Item>
    <Item Name="CreateDate" Type="String">2016/11/14</Item>
    <Item Name="UpdateDate" Type="String">2017/07/11</Item>
    <Item Name="Flags" Type="Integer">0</Item>
    <Item Name="TaxId" Type="Integer">0</Item>
    <Item Name="Length" Type="Integer">93</Item>
    <Item Name="Status" Type="String">live</Item>
    <Item Name="ReplacedBy" Type="String"></Item>
    <Item Name="Comment" Type="String"><![CDATA[  ]]></Item>
    <Item Name="AccessionVersion" Type="String">MEKF00000000.1</Item>
</DocSum>

</eSummaryResult>

i.e. The exact same fields Biopython's Entrez parser is giving you as keys (plus the Id and that Item empty list which puzzled me above).

Are you sure you are comparing like with like here?

Could you give a specific example accession where your R solution has extra data?

Differences between Biopython NCBI/Entrez esummary output and R package output

2 Answers2

Edit -- getting version 2.0 records with Biopython