0

I need some help extracting affiliation information from PubMed search strings in R. I have already successfully extracted affiliation information from a single PubMed ID XML, but now I have a search string of multiple terms that I need to extract the affiliation information from with hope of then creating a data frame with columns such as: PMID, author, country, state etc.

This is my code so far:

my_query <- (PubMed Search String)
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id, format = "abstract")

The PubMed search string is very long, hence why I haven't included it here. The main aim is therefore to produce a dataframe from this search string which is a table clearly showing affiliation and other general information from the PubMed articles.

Any help would be greatly appreciated!

Aber
  • 7
  • 1

1 Answers1

2

Have you tried the pubmedR package? https://cran.rstudio.com/web/packages/pubmedR/index.html

library(pubmedR)
library(purrr)
library(tidyr)

my_query <- '(((("diabetes mellitus"[MeSH Major Topic]) AND ("english"[Language])) AND (("2020/01/01"[Date - Create] : "3000"[Date - Create]))) AND ("coronavirus"[MeSH Major Topic])'

my_request <- pmApiRequest(query = my_query,
                            limit = 5)

You can use the built in function my_pm_df <- pmApi2df(my_request) but this will not provide affiliations for all authors.

You can use a combination of pluck() and map() from purrr to extract what you need into a tibble.

auth <- pluck(my_request, "data") %>% {
  tibble(
    pmid = map_chr(., pluck, "MedlineCitation", "PMID", "text"),
    author_list = map(., pluck, "MedlineCitation", "Article", "AuthorList")
  )
  }

All author data is contained in that nested list, in the Author$AffiliationInfo list (note it is a list because one author can have multiple affiliations).

================================================= EDIT based on comments:

First construct your request URLs. Make sure you replace &email with your email address:

library(httr)
library(xml2)

mypmids <- c("32946812", "32921748", "32921727", "32921708", "32911500", 
             "32894970", "32883566", "32880294", "32873658", "32856805",
             "32856803", "32820143", "32810084", "32809963", "32798472")

my_query <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=",
                   mypmids,
                   "&retmode=xml&email=MYEMAIL@MYDOMAIN.COM")

I like to wrap my API requests in safely to catch any errors. Then use map to loop through the my_query vector. Note we Sys.sleep for 5 seconds after each request to comply with PubMed's rate limit. You can probably cut this down a bit seconds or even less, check in the API documentation.

get_safely <- safely(GET)

my_req <- map(my_query, function(z) {
  print(z)
  req <- get_safely(url = z)
  Sys.sleep(5)
  return(req)
})

Next we parse the request with content() in read_xml(). Note that we are parsing the result:

my_resp <- map(my_req, function(z) {
  read_xml(content(z$result,
                   as = "text",
                   encoding = "UTF-8"))
})

This can probably be cleaned up some but it works. Coerce the AuthorInfo to a list and use a combination of map() , pluck() and unnest(). Note that a given author might have more than one affiliation but am only plucking the first one.

my_pm_list <- map(my_resp, function (z) {
  my_xml <- xml_child(xml_child(z, 1), 1)
  pmid <- xml_text(xml_find_first(my_xml, "//PMID"))
  authinfo <- as_list(xml_find_all(my_xml, ".//AuthorList"))
  return(list(pmid, authinfo))
})

myauthinfo <- map(my_pmids, function(z) {
  auth <- z[[2]][[1]]
})

mytibble <- myauthinfo %>% {
  tibble(
    lastname = map_depth(., 2, pluck, "LastName", 1, .default = NA_character_),
    firstname = map_depth(., 2, pluck, "ForeName", 1, .default = NA_character_),
    affil = map_depth(., 2, pluck, "AffiliationInfo", "Affiliation", 1, .default = NA_character_)
  )
}

my_unnested_tibble <- mytibble %>%
  bind_cols(pmid = map_chr(my_pm_list, pluck, 1)) %>%
  unnest(c(lastname, firstname, affil))
ciakovx
  • 334
  • 1
  • 5
  • Hi, thanks for your fast reply! I did just try your suggestion but for some reason its not picking up my search string at all. I am not sure if it is too long so I will continue with adjusting the code/search string. My query is also multiple 'OR' instead of 'AND', potentially might be too many things in the search string? Thanks again! – Aber Sep 29 '20 at 14:09
  • Can you post your search string here? – ciakovx Sep 29 '20 at 16:50
  • Hi, I have been able to export the PMID's from the search string instead as an alternative way to figure this out. It's around 10,000 PMID's, but the first few are 30361262 31203996 31141631 31028669 30420752 29601269 31628431 27959731 30499168 If that helps at all! I've been trying to play around with doing it this way instead, as it seems to be similar to the PubMed R examples seen online. – Aber Sep 30 '20 at 15:21
  • So what is your end goal? You want to have a data frame with pmid in one column and author affiliation in the other? – ciakovx Sep 30 '20 at 15:38
  • Yes, I have previously created a dataframe with: PMID: Author name: Centre: Country: State I'd like to replicate something similar. – Aber Oct 02 '20 at 09:53