0

I am attempting to convert this XML file from this USA government public data source into a clean relational database/ tabular format for the critical information. The problem I am running into is that there appears to be multiple field names (such as Last Name and Location) stored within the same node. What would be the best way to do some basic parsing of this file?

See this screenshot below from the XML guide where the information inside the orange box is what im struggling to pull apart. enter image description here

Ideally a result would look something like this:

LAST NAME FIRST NAME MID NAME
Doe John Bob
RyanG73
  • 63
  • 1
  • 4
  • Try looking at [XML - How to find specific attributes](https://stackoverflow.com/questions/71244164/xml-how-to-find-specific-attributes). – Alias Cartellano Jul 09 '22 at 16:15
  • 1
    What exactly are you struggling with? Ask either a Python question or an R question. Please post XML as text, not as an image. – mzjn Jul 09 '22 at 17:22

2 Answers2

3

If you are using R, it is straightforward to get these fields using the xml2 or rvest packages. For example, using the first xml file in the linked zip folder:

library(rvest)

entries <- read_html(path_to_xml) %>% 
  html_nodes(xpath = "//info")

result <- data.frame(Last_Name = entries %>% html_attr("lastnm"),
                     First_Name = entries %>% html_attr("firstnm"),
                     Mid_Name = entries %>% html_attr("midnm"))

head(result)
#>   Last_Name First_Name Mid_Name
#> 1    FISHER     ANDREW   MUNSON
#> 2 BACHARACH       ALAN   MARTIN
#> 3     GRAFF    MICHAEL  RAYMOND
#> 4      KAST    WILLIAM    ALLEN
#> 5   McMahan     Robert  Michael
#> 6   JOHNSON       JOHN        C

Created on 2022-07-09 by the reprex package (v2.0.1)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • 1
    thank you - this answers the question. My knowledge gap was the existence of html_attr and how attributes worked. I learned something new today – RyanG73 Jul 09 '22 at 21:23
3

Or with XML package:

library(XML)

doc <- xmlTreeParse(file='c:/RDev/test.xml',useInternalNodes = TRUE)
info <- xpathApply(doc,"//Info",function(x) xmlToList(x))

info[[1]]

[[1]]
           lastNm                                                  firstNm 
         "FISHER"                                                 "ANDREW" 
            midNm                                                  indvlPK 
         "MUNSON"                                                "2917271" 
        actvAGReg                                                     link 
              "N" "https://adviserinfo.sec.gov/individual/summary/2917271" 
Waldi
  • 39,242
  • 6
  • 30
  • 78