2

The following code is being used to parse XML in order to extract information like node, parent, type and so on into a data frame. It works fine for a small XML file of lines but when a file of greater than 25,000 lines is used it takes a couple of minutes to process. Hence I intend optimizing the code to process faster. The aim of the function is to read any XML file and generate data as required by the data frame.

Sample XML:

<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
   <PLANT id="1" required="false">
      <COMMON Source="NLM">Bloodroot</COMMON>
      <BOTANICAL>Aquilegia canadensis</BOTANICAL>
      <DATE>
         <Year>2013</Year>
      </DATE>
   </PLANT>
   <PLANT id="2" required="true">
      <COMMON Source="LNP">Columbine</COMMON>
      <BOTANICAL>Aquilegia canadensis</BOTANICAL>
      <DATE>
         <Year>2014</Year>
      </DATE>
   </PLANT>
</CATALOG>

Output:

                      path      node                value  parent      type
1                  CATALOG   CATALOG                 NULL    NULL   element
2            CATALOG/PLANT     PLANT                 NULL CATALOG   element
3            CATALOG/PLANT        id                    1   PLANT attribute
4            CATALOG/PLANT  required                false   PLANT attribute
5     CATALOG/PLANT/COMMON    COMMON            Bloodroot   PLANT      text
6     CATALOG/PLANT/COMMON    Source                  NLM  COMMON attribute
7  CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis   PLANT      text
8       CATALOG/PLANT/DATE      DATE                 NULL   PLANT   element
9  CATALOG/PLANT/DATE/Year      Year                 2013    DATE      text
10           CATALOG/PLANT     PLANT                 NULL CATALOG   element
11           CATALOG/PLANT        id                    2   PLANT attribute
12           CATALOG/PLANT  required                 true   PLANT attribute
13    CATALOG/PLANT/COMMON    COMMON            Columbine   PLANT      text
14    CATALOG/PLANT/COMMON    Source                  LNP  COMMON attribute
15 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis   PLANT      text
16      CATALOG/PLANT/DATE      DATE                 NULL   PLANT   element
17 CATALOG/PLANT/DATE/Year      Year                 2014    DATE      text

Code Snippet:

library(XML)
library(plyr)

## helper function of xPathApply
getValues <- function(x) {
  List <- list()

  # find all ancestors of a given node
  ancestorNames <- character()  
  ancestorNamesList <- xmlAncestors(x, fun = function(y) {
    ancestorNames <- c(ancestorNames, xmlName(y))})  
  pathName <- paste(ancestorNamesList, collapse = "/")

  # find the parent of a given node
  parentNode <- xmlParent(x)
  parentName <- "NULL"
  if(!is.null(parentNode)) {
    parentName <- xmlName(parentNode)
  } 

  if(inherits(x, "XMLInternalElementNode")) {
    # check if the value of the given node exists i.e. text
    if(length(xmlValue(x, recursive=FALSE)) != 0) {
      List <- append(List, list(path = pathName, node = xmlName(x), value = xmlValue(x, recursive=FALSE), parent = parentName, type = "text"))
    } else {
      List <- append(List, list(path = pathName, node = xmlName(x), value = "NULL", parent = parentName, type = "element"))      
    }
  }

  ## attributes
  if(!is.null(xmlAttrs(x))) {
    num.attributes = xmlSize(xmlAttrs(x))
    for (i in seq_len(num.attributes)) {
      # get the attribute name
      attributeName <- names(xmlAttrs(x)[i])
      # get the attribute value
      attributeValue <- xmlAttrs(x)[[i]]  

      List <- append(List, list(path = pathName, node = attributeName, value = attributeValue, parent = parentName, type = "attribute"))      
    }
  }

  return(List)
}

## recursive function 
visitNode <- function(node, xpath) {
  if (is.null(node)) {
    return()
  }

  # number of children of a node
  num.children <- xmlSize(node)

  bypass <- function(n = num.children) {
    if(num.children == 0) {
      xpathSApply(node, path = xpath, getValues)
    } else {
      return(num.children)
    }
  }

  # recursive call to visitNode 
  for (i in seq_len(num.children)) { 
    visitNode(node[[i]], xpath) 
  }   

  # add list type result to data frame
  if(is.list(result <- bypass())) {    
    dt <<- do.call(rbind.fill, lapply(result, data.frame)) 
  }
} 


# read XML data from the given file
xtree <- xmlParse("test.xml")

# retrieve the root of the XML
root <- xmlRoot(xtree)

# define data frame which is to hold the data interpreted from XML
dt <- data.frame(path = NA, node = NA, value = NA, parent = NA, type = NA)

# call to recursive function
visitNode(root, xpath <- "//node()")

dt
  • 1
    The main inefficiency I see is in not pre-dimensioning the `List` object. Using c() to extend lists can be very inefficient. Using `sapply` would not cure that pathology. See if `List <- list(xmlSize(xmlAttrs(x)) )` and just indexing List by `i` makes things move faster. – IRTFM Dec 17 '14 at 03:32
  • `List[[length(List)+1]]` is wrong. It should probably be `List[[i]]`. I tried this on a sample xml and it returns an empty list – Rich Scriven Dec 17 '14 at 03:34
  • 1
    Please, Richard, get people to pre-dimension. – IRTFM Dec 17 '14 at 03:35
  • 1
    Yes, but it's not easy to tell how many attributes the xml doc might have – Rich Scriven Dec 17 '14 at 03:35
  • I can't use List[[i]] because I am adding other stuff to the list like elements and text before the attributes loop. – user2877232 Dec 17 '14 at 03:38
  • I would use `xmlApply(x, ...)` where `x` is `xmlRoot(doc)` – Rich Scriven Dec 17 '14 at 03:40
  • 3
    I must say that I've been wanting to help with your xml questions, but they are so unclear that I get frustrated and quit. This one is pretty much the same because you don't show any example data and desired result. There must be some rules when working with xml because many nodes are totally different and so the result of this function might be something you don't want. If you could please add a bit more context to this question it would be awesome – Rich Scriven Dec 17 '14 at 04:29
  • @RichardScriven I have updated my post. Hope it's clear now. The retrieval of attributes part that I had posted earlier is a small part of this code which I felt wasn't efficient enough. Since I have all the code posted could you give some suggestions overall to increase efficiency? – user2877232 Dec 17 '14 at 06:43

1 Answers1

4

I really wish there was good XSLT support inR but i can't seem to find a great package for it. A different strategy would be to transform the xml into a simpler data file that you can easily read with read.table or something else. You can pass it pretty easily with xmlEventParse. Here's a custom handler which seems to create the data you want

getHandler<-function(file="", sep=",") {
    list(.startDocument = function(.state) {
           cat("path","node","value","parent","type", file=file, sep=sep)
           cat("\n", file=file, sep=sep, append=T)
           .state
    }, .startElement=function(name, atts, .state) {
       .state$path <- c(.state$path, name)
       cat(paste(.state$path, collapse="/"), name, NA, .state$path[length(.state$path)-1], "element", sep=sep, file=file, append=T)
       cat("\n",  file=file, append=T)
       if(!is.null(atts)) {
           cat(paste(paste(.state$path, collapse="/"), names(atts), atts, .state$path[length(.state$path)-1], "attribute", sep=sep, collapse="\n"), file=file, append=T)
           cat("\n",file=file, append=T)
       }
       .state
    }, .endElement=function(name, .state) {
       .state$path <- .state$path[-length(.state$path)]
       .state
    }, .text=function(value, .state) {
       value <- gsub("^\\s+|\\s+$", "", value)
       if(nchar(value)>0) {
           cat(paste(.state$path, collapse="/"), .state$path[length(.state$path)], value, .state$path[length(.state$path)-1], "text", sep=sep, file=file, append=T)
           cat("\n", file=file, append=T)
       }
       .state
    })
}

So it's not exactly pretty but it's basically just building a string with cat(). We can then use it with

zz <- xmlEventParse("test.xml",
    handlers = getHandler(), 
    state = list(path=character(0)), useDotNames=TRUE)

This will output the the data with comma separated values to the screen. To save to a file, you can do

zz <- xmlEventParse("test.xml",
    handlers = getHandler(file="ok.txt", sep="\t"), 
    state = list(path=character(0)), useDotNames=TRUE)

which will write the data as delimited to a file named "ok.txt". You can then read the data in with

read.table("ok.txt", sep="\t", header=T)

which returns

                      path      node                value  parent      type
1                  CATALOG   CATALOG                 <NA>           element
2            CATALOG/PLANT     PLANT                 <NA> CATALOG   element
3            CATALOG/PLANT        id                    1 CATALOG attribute
4            CATALOG/PLANT  required                false CATALOG attribute
5     CATALOG/PLANT/COMMON    COMMON                 <NA>   PLANT   element
6     CATALOG/PLANT/COMMON    Source                  NLM   PLANT attribute
7     CATALOG/PLANT/COMMON    COMMON            Bloodroot   PLANT      text
8  CATALOG/PLANT/BOTANICAL BOTANICAL                 <NA>   PLANT   element
9  CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis   PLANT      text
10      CATALOG/PLANT/DATE      DATE                 <NA>   PLANT   element
11 CATALOG/PLANT/DATE/Year      Year                 <NA>    DATE   element
12 CATALOG/PLANT/DATE/Year      Year                 2013    DATE      text
13           CATALOG/PLANT     PLANT                 <NA> CATALOG   element
14           CATALOG/PLANT        id                    2 CATALOG attribute
15           CATALOG/PLANT  required                 true CATALOG attribute
16    CATALOG/PLANT/COMMON    COMMON                 <NA>   PLANT   element
17    CATALOG/PLANT/COMMON    Source                  LNP   PLANT attribute
18    CATALOG/PLANT/COMMON    COMMON            Columbine   PLANT      text
19 CATALOG/PLANT/BOTANICAL BOTANICAL                 <NA>   PLANT   element
20 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis   PLANT      text
21      CATALOG/PLANT/DATE      DATE                 <NA>   PLANT   element
22 CATALOG/PLANT/DATE/Year      Year                 <NA>    DATE   element
23 CATALOG/PLANT/DATE/Year      Year                 2014    DATE      text

Now there are more rows then you had in your sample, but some of the selection rules weren't that clear to me.

The main idea is that xmlEventParse is more efficient than xmlParse because it doesn't have to load the entire tree. Additionally by using cat() to dump to a file, i don't have to worry about memory management right away (but it's not exactly like writing to disk is all that great either).

Anyway, it's at least another option to consider.

MrFlick
  • 195,160
  • 17
  • 277
  • 295