1

I'm dealing with huge XML files e.g. 75GB and more, my point is any small overhead will turn into many minutes if not hours slower processing.

The core of my code does the following while parsing a XML chunk. Let's say I have a chunk of 3 lines. Note that I only care about the a, b and c attributes but there may be item elements with missing attributes e.g.

xmlvec <- c('<item a="1" c="2" x="very long whatever" />',
            '<item b="3" c="4" x="very long whatever" />',
            '<item a="5" b="6" c="7" x="very long whatever" />')

I define a mapping including which attributes to look up and what to rename them to, that's it, the ones I'd like to read:

mapping <- c("a", "b", "c")
# this doesn't matter here
#names(mapping) <- c("aa", "bb", "cc") 

If I do the following I get missing values and/or NA column names due to the way the missing attributes affect the binding of the rows, note the missing b column since the first item element doesn't have it:

df <- as.data.frame(do.call(
  rbind,
  lapply(xml_children(read_xml(paste("<xml>", paste(xmlvec, collapse=""), "</xml>"))),
         function(x) {
           xml_attrs(x)[mapping]
         }
  )
), stringsAsFactors = FALSE)
df
     a   NA c
1    1 <NA> 2
2 <NA>    3 4
3    5    6 7

Since attribute b is missing in the first row of this mini chunk I get an NA column which I can't match later to any column name. Since the first line of any chunk is arbitrary and can have any missing attributes I need to enforce the schema while reading each attribute so that the enclosing data frame doesn't get broken but this is very expensive performance-wise:

df <- as.data.frame(do.call(
  rbind,
  lapply(xml_children(read_xml(paste("<xml>", paste(xmlvec, collapse=""), "</xml>"))),
         function(x) {
           y <- xml_attrs(x)[mapping]
           if (any(is.na(names(y)))) {
             y <- y[-which(is.na(names(y)))]
           }
           y[setdiff(mapping, names(y))] <- NA
           y[order(factor(names(y), levels=mapping))]
         }
  )
), stringsAsFactors = FALSE)
df
     a    b c
1    1 <NA> 2
2 <NA>    3 4
3    5    6 7

See that now the column schema and order is enforced but paying a very high penalty in performance since this is done on a per-line basis. Is there a better way?

SkyWalker
  • 13,729
  • 18
  • 91
  • 187

2 Answers2

2

This solution works on the example provided. If your XML is more complicated with nested elements you may need an alternative.

The basic idea is to convert each vector of xmlvec into a data.frame, then use data.table::rbindlist() to bind the data.frames together. rbindlist also takes care of filling in missing data if you use fill = T and use.names = T

library(XML)
library(data.table)
library(magrittr)

xmlvec <- c('<item a="1" c="2" x="very long whatever" />',
            '<item b="3" c="4" x="very long whatever" />',
            '<item a="5" b="6" c="7" x="very long whatever" />')

l <- lapply( xmlvec, function(x) {
  XML::xmlToList( x ) %>%
    t() %>%
    as.data.frame()
})

dt <- data.table::rbindlist(l, use.names = T, fill = T)
dt
#       a c                  x    b
# 1:    1 2 very long whatever <NA>
# 2: <NA> 4 very long whatever    3
# 3:    5 7 very long whatever    6

## Now you can subset the columns of interest if you so wish
mapping <- c("a","b","c")
dt[, ..mapping]
#       a    b c
# 1:    1 <NA> 2
# 2: <NA>    3 4
# 3:    5    6 7

If you prefer a tidyverse approach I think dplyr::bind_rows() may provide similar functionality to rbindlist.


To down the select the columns before binding you can remove them inside the lapply

l <- lapply( xmlvec, function(x) {
  XML::xmlToList( x ) %>%
    t() %>%
    as.data.frame() %>%
    dplyr::select( intersect(names(.), mapping ) )
})

dt <- data.table::rbindlist(l, use.names = T, fill = T)
dt
#       a c    b
# 1:    1 2 <NA>
# 2: <NA> 4    3
# 3:    5 7    6
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • Thank you! in your solution the columns are excluded after loading, I need to exclude those before (they are huge) and will run out of memory plus the inefficiency of fetching/binding/processing those columns to later exclude. – SkyWalker Dec 16 '19 at 02:18
  • @SkyWalker - I've updated; is this what you're after? – SymbolixAU Dec 16 '19 at 02:24
  • thank you! I'll test it tomorrow ... going to bed now :) – SkyWalker Dec 16 '19 at 02:28
2

A completely different approach, but likely to be the most efficient*, is to write your own XML parser using Rcpp and rapidxmlr.

This will work on a single .xml document (which, judging by all your other recent questions is what you're working with).

So we can take your small example and make it into a single xml document

xml <- '<?xml version="1.0" encoding="UTF-8"?>
<items>
<item a="1" c="2" x="very long whatever" />
<item b="3" c="4" x="very long whatever" />
<item a="5" b="6" c="7" x="very long whatever" />
</items>'

writeLines(xml, "~/Desktop/xml_test/xml.xml")

Then we can write some C++ code through Rcpp, using the rapidxmlr library.

I'm also using the development library gpx because I want this node_size() function. (I'm the author of this gpx library too).

For this trivial example where you know the vectors a, b and c we can simply define them up the top and fill them if the attribute exists.

library(rapidxmlr)
library(Rcpp)
# devtools::install_github("dcooley/gpx")
library(gpx)

cppFunction(
  depends = c("rapidxmlr", "gpx")
  , include = c(
    '#include "rapidxml.hpp"'
    , '#include "gpx/utils.hpp"'
    , '#include <fstream>' 
  )
  , code = "
    Rcpp::DataFrame df_from_xml( std::string xml_file ) {

      // parse document
      rapidxml::xml_document<> doc;
      std::ifstream theFile( xml_file );

      std::vector< char > buffer(
        (std::istreambuf_iterator<char>(theFile)),
        std::istreambuf_iterator<char>()
      );
      buffer.push_back('\\0');
      doc.parse<0>( &buffer[0] );

      // get root node
      rapidxml::xml_node<> *root_node = doc.first_node(\"items\");

      size_t n = gpx::utils::node_size( root_node, \"item\");

      // initalise vectors to store the results
      Rcpp::StringVector a( n, Rcpp::StringVector::get_na());
      Rcpp::StringVector b( n, Rcpp::StringVector::get_na());
      Rcpp::StringVector c( n, Rcpp::StringVector::get_na());

      R_xlen_t counter = 0;

      for(
        rapidxml::xml_node<> *item_node = root_node->first_node(\"item\");
        item_node;
        item_node = item_node -> next_sibling()
      ) {

        // get attributes of the node
        if( item_node -> first_attribute(\"a\") ) {
          a[ counter ] = item_node -> first_attribute(\"a\") -> value();
        }
        if( item_node -> first_attribute(\"b\") ) {
          b[ counter ] = item_node -> first_attribute(\"b\") -> value();
        }
        if( item_node -> first_attribute(\"c\") ) {
          c[ counter ] = item_node -> first_attribute(\"c\") -> value();
        }

        counter++;
      }

      return Rcpp::DataFrame::create(
        _[\"a\"] = a,
        _[\"b\"] = b,
        _[\"c\"] = c
      );

    }
  "
)

f <- normalizePath("~/Desktop/xml_test/xml.xml")
df_from_xml( f )
#      a    b c
# 1    1 <NA> 2
# 2 <NA>    3 4
# 3    5    6 7

*efficiency in terms of code speed, not learning how to write Rcpp / C++ code.

SymbolixAU
  • 25,502
  • 4
  • 67
  • 139