How do I pull specific fields from this XML file using R or Python?

Question

I am attempting to convert this XML file from this USA government public data source into a clean relational database/ tabular format for the critical information. The problem I am running into is that there appears to be multiple field names (such as Last Name and Location) stored within the same node. What would be the best way to do some basic parsing of this file?

See this screenshot below from the XML guide where the information inside the orange box is what im struggling to pull apart.

Ideally a result would look something like this:

LAST NAME	FIRST NAME	MID NAME
Doe	John	Bob

Try looking at [XML - How to find specific attributes](https://stackoverflow.com/questions/71244164/xml-how-to-find-specific-attributes). — Alias Cartellano, Jul 09 '22 at 16:15
What exactly are you struggling with? Ask either a Python question or an R question. Please post XML as text, not as an image. — mzjn, Jul 09 '22 at 17:22

score 3 · Accepted Answer · answered Jul 09 '22 at 17:00

If you are using R, it is straightforward to get these fields using the xml2 or rvest packages. For example, using the first xml file in the linked zip folder:

library(rvest)

entries <- read_html(path_to_xml) %>% 
  html_nodes(xpath = "//info")

result <- data.frame(Last_Name = entries %>% html_attr("lastnm"),
                     First_Name = entries %>% html_attr("firstnm"),
                     Mid_Name = entries %>% html_attr("midnm"))

head(result)
#>   Last_Name First_Name Mid_Name
#> 1    FISHER     ANDREW   MUNSON
#> 2 BACHARACH       ALAN   MARTIN
#> 3     GRAFF    MICHAEL  RAYMOND
#> 4      KAST    WILLIAM    ALLEN
#> 5   McMahan     Robert  Michael
#> 6   JOHNSON       JOHN        C

^{Created on 2022-07-09 by the reprex package (v2.0.1)}

thank you - this answers the question. My knowledge gap was the existence of html_attr and how attributes worked. I learned something new today — RyanG73, Jul 09 '22 at 21:23

Waldi · Answer 2 · 2022-07-09T18:23:47.677

Or with XML package:

library(XML)

doc <- xmlTreeParse(file='c:/RDev/test.xml',useInternalNodes = TRUE)
info <- xpathApply(doc,"//Info",function(x) xmlToList(x))

info[[1]]

[[1]]
           lastNm                                                  firstNm 
         "FISHER"                                                 "ANDREW" 
            midNm                                                  indvlPK 
         "MUNSON"                                                "2917271" 
        actvAGReg                                                     link 
              "N" "https://adviserinfo.sec.gov/individual/summary/2917271"

How do I pull specific fields from this XML file using R or Python?

2 Answers2