How to use entrezpy and Biopython Entrez libraries to access ClinVar data from genomic position of variant

Question

[Disclaimer: I have published this question 3 weeks ago in biostars, with no answers yet. I really would like to get some ideas/discussion to find a solution, so I post also here. biostars post link: https://www.biostars.org/p/447413/]

For one of my projects of my PhD, I would like to access all variants, found in ClinVar db, that are in the same genomic position as the variant in each row of the input GSVar file. The language constraint is Python.

Up to now I have used entrezpy module: entrezpy.esearch.esearcher. Please see more for entrezpy at: https://entrezpy.readthedocs.io/en/master/

From the entrezpy docs I have followed this guide to access UIDs using the genomic position of a variant: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html in code:

# first get UIDs for clinvar records of the same position
# credits: credits: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html
chr = variants["chr"].split("chr")[1]
start, end = str(variants["start"]), str(variants["end"])

es = entrezpy.esearch.esearcher.Esearcher('esearcher', self.entrez_email)
genomic_pos = chr + "[chr]" + " AND " + start + ":" + end  # + "[chrpos37]"
entrez_query = es.inquire(
    {'db': 'clinvar',
     'term': genomic_pos,
     'retmax': 100000,
     'retstart': 0,
     'rettype': 'uilist'})  # 'usehistory': False
entrez_uids = entrez_query.get_result().uids

Then I have used Entrez from BioPython to get the available ClinVar records:

# process each VariationArchive of each UID
    handle = Entrez.efetch(db='clinvar', id=current_entrez_uids, rettype='vcv')
    clinvar_records = {}
    tree = ET.parse(handle)
    root = tree.getroot()

This approach is working. However, I have two main drawbacks:

entrezpy fulls up my log file recording all interaction with Entrez making the log file too big to be read by the hospital collaborator, who is variant curator.
entrezpy function, entrez_query.get_result().uids, will return all UIDs retrieved so far from all the requests (say a request for each variant in GSvar), thus this space inefficient retrieval. That is the entrez_uids list will quickly grow a lot as I process all variants from a GSVar file. The simple solution that I have implenented is to check which UIDs are new from the current request and then keep only those for Entrez.fetch(). However, I still need to keep all seen UIDs, from previous variants in order to be able to know which is the new UIDs. I do this in code by:
```
# first snippet's first lines go here
entrez_uids = entrez_query.get_result().uids
current_entrez_uids = [uid for uid in entrez_uids if uid not in self.all_entrez_uids_gsvar_file]
self.all_entrez_uids_gsvar_file += current_entrez_uids
```

Does anyone have suggestion(s) on how to address these two presented drawbacks?

I know it's a long time since, but could my custom implementation of https://github.com/krassowski/easy-entrez be of any help here? I knew of the problem of entrezpy accumulating history and it was one of the reasons I went ahead and created easy-entrez (the other being the entrezpy authors not responding to bug reports at that time). — krassowski, Apr 04 '21 at 21:48
Thank you for coding up such solution! As I was needing a time-efficient way to extract variants from ClinVar in an every day diagnostic setting; I have downloaded the clinvar vcf file and then use https://github.com/jamescasbon/PyVCF to extract variants of interest using the index file. — Damianos P. Melidis, May 06 '21 at 10:09

How to use entrezpy and Biopython Entrez libraries to access ClinVar data from genomic position of variant

0 Answers0