Extracting Affiliation data from a single PubMed record

Question

I have been successful at extracting affiliation data from a single pubmed record, by using easyPubMed & lots of searching (I am still very new to R). The issue with the data is that it is only reporting one part of the affiliation information, I am assuming this is due to the various types of information in a non-standardised string.

My code is as follows:

#PubMed query via easyPubMed using the URL of the XML

my_query <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=20301425&retmode=xml"
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id, format = "abstract")
print(my_abstracts_txt[1:16])


my_abstracts_xml <- fetch_pubmed_data(my_entrez_id)
class(my_abstracts_xml)


print(my_titles)


#EasyPubMed Extracting Affiliation data from a single PubMed Record

#Convert XML PubMed records to strings using the articles_to_list function
#Each record in the list is a string that still includes XML tags
my_PM_list <- articles_to_list(my_abstracts_xml)
class(my_PM_list[[4]])
cat(substr(my_PM_list[[4]], 1, 984))

#Affiliation can be extracted from a specific record using the custom_grep() function
#The fields extracted from the record will be returned as elements of a list or a character vector

curr_PM_record <- my_PM_list[[(length(my_PM_list) - 3)]]
Affiliation_Info.data <- custom_grep(curr_PM_record, tag = "AffiliationInfo")

View(Affiliation_Info)


curr_PM_record <- my_PM_list[[(length(my_PM_list) - 3)]]

I ideally would like to produce a dataframe such as: PMID: Author: Affiliation

(but first just focusing in pulling out all the affiliation information from the pubmed URL)

But I am really struggling to do so, and would appreciate any help on this matter

Thanks in advance!

Per `r` tag (hover or click to see): Use `dput()` for data and specify all non-base packages with `library()` calls. For reproducibility, please show us a sample of XML data or returned extraction of these package(?) function calls. We cannot see any of your `class`, `print`, `cat`, or `View` results. — Parfait, Aug 13 '20 at 19:50

score 0 · Accepted Answer · answered Aug 14 '20 at 09:55

0

Here is a xml2 approach...

library( xml2 )
library( magrittr )

#read the xml-data
doc <- xml2::read_xml( "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=20301425&retmode=xml" )

pmid    <- xml2::xml_find_first( doc, ".//PMID") %>% xml2::xml_text()
authors <- paste( 
  xml2::xml_find_all( doc, ".//AuthorList[@Type = 'authors']/Author/LastName") %>% xml2::xml_text(),
  xml2::xml_find_all( doc, ".//AuthorList[@Type = 'authors']/Author/ForeName") %>% xml2::xml_text(),
  sep = ", " )
affiliate <- xml2::xml_find_all( doc, ".//AuthorList[@Type = 'authors']/Author/AffiliationInfo/Affiliation") %>% xml2::xml_text()

df <- data.frame( pmid = pmid, authors = authors, affiliate = affiliate )

whick looks like:

answered Aug 14 '20 at 09:55

Wimpel

26,031
1
20
37

Thank you for this! This makes perfect sense, and using these programs has made it more clear to me. – Aber Aug 17 '20 at 08:43
Sorry, me again. Do you know if it's possible to then pull out the country/state from the affiliate into a fourth column? – Aber Aug 20 '20 at 13:53
then you would have to split the affiliate-string after the second comma... like this.. `gsub( ".*, ([a-zA-Z]+, [a-zA-Z]+$)", "\\1", "Moffitt Cancer Center, Tampa, Florida")` but there are many (probably smarter) ways to achieve the same result. – Wimpel Aug 20 '20 at 20:53

Extracting Affiliation data from a single PubMed record

1 Answers1

Linked