Parsing large and complicated XML file to data.frame

Question

So, I have large XML file with lots of reports. I created data example below to approximately show the size of xml and its structure:

x <- "<Report><Agreements><AgreementList /></Agreements><CIP><RecordList><Record><Date>2017-05-26T00:00:00</Date><Grade>2</Grade><ReasonsList><Reason><Code>R</Code><Description>local</Description></Reason></ReasonsList><Score>xxx</Score></Record><Record><Date>2017-04-30T00:00:00</Date><Grade>2</Grade><ReasonsList><Reason><Code>R</Code><Description/></Reason></ReasonsList><Score>xyx</Score></Record></RecordList></CIP><Individual><Contact><Email/></Contact><General><FirstName>MM</FirstName></General></Individual><Inquiries><InquiryList><Inquiry><DateOfInquiry>2017-03-19</DateOfInquiry><Reason>cc</Reason></Inquiry><Inquiry><DateOfInquiry>2016-10-14</DateOfInquiry><Reason>er</Reason></Inquiry></InquiryList><Summary><NumberOfInquiries>2</NumberOfInquiries></Summary></Inquiries></Report>"

x <- paste(rep(x, 1.5e+5), collapse = "")
x <- paste0("<R>", x, "</R>")
require(XML)
p <- xmlParse(x)
p <- xmlRoot(p)
p[[1]]

I would like to transform this data to data.frame, but the structure of XML isn't straightforward. Previously working with XMLs I created loop that for every report transforms its sub nodes to data.frame, but here (in this data) the sub node count is greater than 30 (didn't put all of them in the example), and the structure differs (List nodes can occur even 2 levels deep in XML).

So I have few questions:

1) I am sure that looping over reports isn't the best way to handle this. How should I approach this problem?

2) Can I somehow extract all the data of one report two one line of data.frame (recursively maybe)?

3) Or can I automatically create separate data.frames for each list object of XML?

Any help would be much appreciated.

Update:

Example of results could look like this:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  17 variables:
 $ Record.1.Date                : chr "2017-05-26T00:00:00"
 $ Record.1.Grade               : num 2
 $ Record.1.Reason.1.Code       : chr "R"
 $ Record.1.Reason.1.Description: chr "local"
 $ Record.1.Score               : chr "xxx"
 $ Record.2.Date                : chr "2017-05-26T00:00:00"
 $ Record.2.Grade               : num 2
 $ Record.2.Reason.1.Code       : chr "R"
 $ Record.2.Reason.1.Description: chr "NA"
 $ Record.2.Score               : chr "xyx"
 $ Email.1                      : chr "NA"
 $ FirstName                    : chr "MM"
 $ Inquiry.1.DateOfInquiry      : POSIXct, format: "2017-03-19"
 $ Inquiry.1.Reason             : chr "cc"
 $ Inquiry.2.DateOfInquiry      : POSIXct, format: "2016-10-14"
 $ Inquiry.2.Reason             : chr "er"
 $ NumberOfInquiries            : num 2

, but as I mentioned previously, sub lists could also be in separate tables.

Please try and ask one question per post, multiple questions often implies that your question is a bit too broad. Especially the memory leak issue should have it's own question, with very concrete example code that shows the memory leak. — Paul Hiemstra, May 30 '17 at 09:07
@PaulHiemstra Agree, removed the last question. The rest of questions, in my opinion, are one: What is the best way to accomplish my task? — minem, May 30 '17 at 09:13
@AndrewGustar `xmlToDataFrame(p)` does not work in this case (and many other), when xml is with complicated structure. It concatenates all of sub nodes — minem, May 30 '17 at 10:13
Yes, although I thought it might be easier to separate them after converting to a df, rather than directly from the XML. — Andrew Gustar, May 30 '17 at 10:16
@AndrewGustar In my posted example, maybe, it looks like it could be easier, but in real data, the structure is more complex and some sub nodes does not have values or they contains lists. Using `xmlToDataFrame` i would get unstructured character strings, which would not be easily separable (if even possible) — minem, May 30 '17 at 10:26
Nasty. I usually end up just treating these as text files and ploughing through them the hard way! Hopefully somebody here will have more finesse. — Andrew Gustar, May 30 '17 at 10:32
Can you please give an example of the desired output row corresponding to one report. — 9Heads, Jun 02 '17 at 06:34

antonio · Accepted Answer · 2017-06-06T08:57:00.197

L=xmlToList(x)
str(data.frame(t(unlist(L)), stringsAsFactors=FALSE))
# 'data.frame': 1 obs. of  15 variables:
#  $ CIP.RecordList.Record.Date                          : chr "2017-05-26T00:00:00"
#  $ CIP.RecordList.Record.Grade                         : chr "2"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Code       : chr "R"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Description: chr "local"
#  $ CIP.RecordList.Record.Score                         : chr "xxx"
#  $ CIP.RecordList.Record.Date.1                        : chr "2017-04-30T00:00:00"
#  $ CIP.RecordList.Record.Grade.1                       : chr "2"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Code.1     : chr "R"
#  $ CIP.RecordList.Record.Score.1                       : chr "xyx"
#  $ Individual.General.FirstName                        : chr "MM"
#  $ Inquiries.InquiryList.Inquiry.DateOfInquiry         : chr "2017-03-19"
#  $ Inquiries.InquiryList.Inquiry.Reason                : chr "cc"
#  $ Inquiries.InquiryList.Inquiry.DateOfInquiry.1       : chr "2016-10-14"
#  $ Inquiries.InquiryList.Inquiry.Reason.1              : chr "er"
#  $ Inquiries.Summary.NumberOfInquiries                 : chr "2"

If you want to convert strings that have a suitable representation as numbers, assuming that df is the data frame above:

data.frame(t(lapply(df, function(x) 
               ifelse(is.na(y<-suppressWarnings(as.numeric(x))), x, y))))

Strings that do not have a number representation will not be converted.

Update

Motivation

A) In some comments the OP added a further request for execution speed, which is normally not a issue for one time tasks such as data import. The solution above is based on recursion, as explicitly required in the question. Of course, traversing up and down the nodes adds a lot of overhead.

B) One recent answer here proposes a complex method based on a collection of external tools. There might of course be different nice utilities to manage XML files, but IMHO much of the XPATH work can be comfortably and efficiently done in R itself.

C) The OP wonders if it is possible to "create separate data.frames for each list object of XML".

D) I noticed that in the question tags, the OP (seems to) require the newer xml2 package.

I address the points above using XPATH straight from R.

XPATH approach

Below I extract in a separate data frame the Record node. One can use the same approach for other (sub)nodes too.

library(xml2)
xx=read_xml(x)                                                                              
xx=(xml_find_all(xx, "//Record"))
system.time(
    xx <- xml_find_all(xx, ".//descendant::*[not(*)]"))
#  user  system elapsed 
# 38.00    0.36   38.35 
system.time(xx <- xml_text(xx))
#  user  system elapsed 
# 68.39    0.05   68.53 
head(data.frame(t(matrix(xx, 5))))
#                    X1 X2 X3    X4  X5
# 1 2017-05-26T00:00:00  2  R local xxx
# 2 2017-04-30T00:00:00  2  R       xyx
# 3 2017-05-26T00:00:00  2  R local xxx
# 4 2017-04-30T00:00:00  2  R       xyx
# 5 2017-05-26T00:00:00  2  R local xxx
# 6 2017-04-30T00:00:00  2  R       xyx

(You might want to add further code to name data frame columns)

Time is referred to my average laptop.

Explanations

The core of the solutions lies in the XPATH .//descendant::*[not(*)]. .//descendant:: extracts all descendants of the current context (the Record node); adding [not(*)] further flattens the layout. This allows to linearize a tree structure, making it more for suitable for data science modeling.

The flexibility of * comes at a price in terms of computation. However, the computational burden does no lie on R, which is an interpreted language, but comes at the expenses of the highly efficient external C library libxml2. Results should be equal or better than those of other utilities and libraries.

Did you try and run this on my sample data? It contains 150000 reports, and it took for me approximately 18 minutes to run `xmlToList`. As my real data is more complex and bigger, I do not what to think about timing of transforming it. I think that there have to be another way to accomplish my task. — minem, Jun 02 '17 at 18:56
@MārtiņšMiglinieks: Please, state exactly in your question what solution you are looking for and what you are *not* looking for. If you write: "the structure of XML isn't straightforward. Previously working with XMLs I created loop" "I am sure that looping over reports isn't the best way", that clearly means that you want a working solution based on the XML package without for-loops. And this such a solution. Also note that answers can only be based on the actual data you posted. — antonio, Jun 02 '17 at 20:19
Maybe you did not notice, but in my question I created sample data, that approximately shows how big my data is. (line: `x <- paste(rep(x, 1.5e+5), collapse = ""`) I am not looking for answer based on specific packages, but it should be somewhat generalized, and speed should be one of main concerns. — minem, Jun 03 '17 at 12:00
@MārtiņšMiglinieks: I noticed: code is based on your sample `x <- " ...` . 20 mins to import a huge XML data set are totally acceptable for most scenarios, particularly when data import is done just once. If you have specific time constraints, please, add them to your question specifying what is an _acceptable time span_, so we might check if it is practically achievable. — antonio, Jun 03 '17 at 13:02

score 3 · Answer 2 · answered Jun 04 '17 at 03:00

Because you mention, I would like to transform this data, consider XSLT, the special-purpose transformation language designed to restructure complex XML to various end-use structures. And in your case flattening any text holding nodes in XML which then can easily be imported with xmlToDataFrame(). While below uses xsltproc and .NET Xsl class, any external processor or language module (e.g., Python, Java, C#, VB, PHP) that supports XSLT 1.0 can work:

XSLT (save as .xsl file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="/R">  
        <xsl:copy> 
            <xsl:apply-templates select="Report"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="Report">
        <xsl:copy> 
            <xsl:apply-templates select="descendant::*[string-length(text())>0]"/>
        </xsl:copy>
    </xsl:template>

   <xsl:template match="*">  
      <xsl:element name="{concat(local-name(), position())}">
      <xsl:value-of select="." />
      </xsl:element>
   </xsl:template>

</xsl:stylesheet>

XML Output (with numbered suffixes to avoid repeated column error)

<?xml version="1.0"?>
<R>
  <Report>
    <Date1>2017-05-26T00:00:00</Date1>
    <Grade2>2</Grade2>
    <Code3>R</Code3>
    <Description4>local</Description4>
    <Score5>xxx</Score5>
    <Date6>2017-04-30T00:00:00</Date6>
    <Grade7>2</Grade7>
    <Code8>R</Code8>
    <Score9>xyx</Score9>
    <FirstName10>MM</FirstName10>
    <DateOfInquiry11>2017-03-19</DateOfInquiry11>
    <Reason12>cc</Reason12>
    <DateOfInquiry13>2016-10-14</DateOfInquiry13>
    <Reason14>er</Reason14>
    <NumberOfInquiries15>2</NumberOfInquiries15>
  </Report>
</R>

R Mac/Linux Script (calling xsltproc, available package on unix machines)

library(XML)

setwd("/path/to/working/folder")

# COMMAND LINE CALL (INSTALL xsltproc IN TERMINAL)
system(paste("cd", getwd(), " && xsltproc -o Output.xml XSLTScript.xsl Input.xml"))

# PARSE AND LOAD TO DF
doc <- xmlParse('Output.xml')
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//Report"))

str(df)
# 'data.frame': 6 obs. of  15 variables:
#  $ Date1              : chr  "2017-05-26T00:00:00" "2017-05-26T00:00:00" "2017-05-26T00:00:00" "2017-05-26T00:00:00" ...
#  $ Grade2             : chr  "2" "2" "2" "2" ...
#  $ Code3              : chr  "R" "R" "R" "R" ...
#  $ Description4       : chr  "local" "local" "local" "local" ...
#  $ Score5             : chr  "xxx" "xxx" "xxx" "xxx" ...
#  $ Date6              : chr  "2017-04-30T00:00:00" "2017-04-30T00:00:00" "2017-04-30T00:00:00" "2017-04-30T00:00:00" ...
#  $ Grade7             : chr  "2" "2" "2" "2" ...
#  $ Code8              : chr  "R" "R" "R" "R" ...
#  $ Score9             : chr  "xyx" "xyx" "xyx" "xyx" ...
#  $ FirstName10        : chr  "MM" "MM" "MM" "MM" ...
#  $ DateOfInquiry11    : chr  "2017-03-19" "2017-03-19" "2017-03-19" "2017-03-19" ...
#  $ Reason12           : chr  "cc" "cc" "cc" "cc" ...
#  $ DateOfInquiry13    : chr  "2016-10-14" "2016-10-14" "2016-10-14" "2016-10-14" ...
#  $ Reason14           : chr  "er" "er" "er" "er" ...
#  $ NumberOfInquiries15: chr  "2" "2" "2" "2" ...

R Windows (using Powershell xsl script calling .NET Xsl class, see here)

library(XML)

# COMMAND LINE CALL (NO INSTALLS NEEDED)
system(paste0('Powershell.exe -File',
              ' "C:\\Path\\To\\PowerShell\\Script.ps1"',
              ' "C:\\Path\\To\\Input.xml"',
              ' "C:\\Path\\To\\XSLT\\Script.xsl"', 
              ' "C:\\Path\\To\\Output.xml"'))

# PARSE AND LOAD TO DF
doc <- xmlParse('Output.xml')
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//Report"))

Sorry, this is different from what asked in the example of results. The OP requires missing values too (see email field). What are execution speed improvements when compared with the whole R approach? — antonio, Jun 05 '17 at 16:04

Parsing large and complicated XML file to data.frame

Update:

2 Answers2

Update

Linked