Finding number of articles for a disease using PubMed (python)

Question

I am looking for a way to efficiently ask Entrez (Biopython) to retrieve the number of articles in PubMed associated to a given indication/condition. I only have the list of full indications.

Now, I have worked out a way, the only problem being that it is quite imprecise. Indeed, it does not take into account possible biases coming from the "way the disease is described/written". Ideally, I would like to retrieve the mesh term associated to a condition and find out the number of articles associated to that mesh term.

Thank you a lot,

Federico

EDIT 1:

Please add your code, otherwise you probably wont get an answer.

Yes, sorry:

query = "aneurysm"
handle1 = Entrez .esearch(db="mesh", term=query)
record1 = Entrez.read(handle1)
handle.close()

Basically the above code starts from a disease and tries to access the mesh codes of thew disease. The problem is that this approach is very unstable and prone to mistakes (since for instance writing "diabetes" or "diabetes type II" or "diabetes type 2" produce slightly different results).

For the latter reasons, having new chemical trials identifiers (NCTID), a more structured approach:

import pandas as pd
Entrez.email = "mymail@gmail.ccom"
#search_results = Entrez.read(Entrez.esearch(db="pubmed", term = "NCT00000419[SI]"))
#count = int(search_results)
#records = count
handle1 = Entrez.esearch(db="pubmed", retmax=10, term="NCT00646048[si]",idtype="acc")
record1 = Entrez.read(handle1)
handle.close()
int(record1["Count"]) >= 2

I typed "NCT00000419[SI]" based on the article: [Linking ClinicalTrials.gov and PubMed to Track Results of Interventional Human Clinical Trials][1] at the section PubMEd.

The two above are of course easy attempts and my final goal is still retrieve the number of articles associated to an indication. Passing from NCTID is a way to do that, since apparently NCT has also mesh terms in it.

Thanks again!

EDIT 2: I have tried something like the following, but again, Indication levels are too broad. I would like to find a "more objective way" to count the number of articles. The best option to me is to use NCTID:

df=pd.read_stata("/Users/federiconutarelli/Desktop/First_work/PubMed/indictations_nomatch.dta")

indicationlevel3 = indicationlevel3.tolist()

years = [2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013]
records = {}
for indication in indicationlevel3:
    for year in years:
        records[(indication, year)] = 0
search_results = {}
count={}

for indication in indicationlevel3:
    for year in years:
        Entrez.email = "mymail@gmail.com"
        search_results[(indication, year)] = Entrez.read(Entrez.esearch(db="pubmed",
                                            term=indication,
                                            mindate=year, maxdate=year, datetype="pdat",
                                            usehistory="y"))
        count[(indication, year)] = int(search_results[(indication, year)]["Count"])
        #records[(indication, year)].append(count[(indication, year)])
        records[(indication, year)] = count[(indication, year)]

for using NCT I have tried:

Entrez.email = "mymail@gmail.com"
id= "NCT00646048[si]"
handle = Entrez.efetch(db="pubmed", id=id, rettype="gb", retmode="xml")
record = Entrez.read(handle)
abstract=record['PubmedArticle'][0]['MedlineCitation']['Article']
abstract ```

But it does not seem to work.



 [1]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3706420/

Please add your code, otherwise you probably wont get an answer. — Stefan, Jul 26 '20 at 08:53
This is an interesting problem, as the writers themselves may try to obscure Adverse Effects by using different terms. For example, in a drug trial, adverse effects of "heart attack," "myocardial infarction," and "AMI" all map to the same Mesh term. However, the writers might use all three terms to blunt the detrimental aspects of the heart attack. — rajah9, Jul 26 '20 at 11:16
By the way, I think you *should* get different results for "Diabetes" vs "Diabetes Type 2." Did you want your program to search to the broader or the more specific? Of course "Type 2" and "Type II" should give the same results. Interestingly, scholar.google.com gives 2.7M for the former and 2.6M for the latter — rajah9, Jul 26 '20 at 11:21
Hi @rajah9 and thank you for the reply. I would like it to search for the more specific version. In particular, I have a list of Indications of Level 3 (i.e. at the ATC level 3) that I uploaded together with some associated NCTID. So, for instance, I have: Acne as Indication Level 3. For these indications I have a bunch of trials identified with NCTIDs from clinical trials.gov. What I would like to do is to take either the NCTID (better since this does not have the biases above described) or the indication level 3 and look for the number of articles in PubMed. Thanks again! — Lusian, Jul 26 '20 at 15:33
So in your last code fragment, you say "It does not seem to work." What is not working? Is `NCT00646048` one of the IDs you are looking for? I'm not familiar with this API, so can you tell me what this ID indicates and what you were expecting? — rajah9, Jul 27 '20 at 11:55
@rajah9 NCT ID stands for New clinical trial and is retrieved in Clinicaltrials.gov. As stated in the article I posted and also in wikipedia, one should be able to recover the mesh term associated to an NCT id from 2005 by specifying it as a secondary identifier in the url requested. Yet, when trying to input the nct as secondary identifier, it does not seem to work in the sense that it does not retrieve the nct for the years in which I have them in my data. In general, i.e. even without using NCT ids, is there a way to retrieve the number of articles starting from section C of mesh database? — Lusian, Jul 27 '20 at 17:46

Finding number of articles for a disease using PubMed (python)

0 Answers0