4

I have a big bunch of xml-files, which I need to process. For that matter I want to be able to read the files, and save the resulting list of objects to disk. I tried to save the list with readr::write_rds, but after reading it in again, the object is somewhat modified, and not valid any more. Is there anything I can do to alleviate this problem?

library(readr)
library(xml2)

x <- read_xml("<foo>
              <bar>text <baz id = 'a' /></bar>
              <bar>2</bar>
              <baz id = 'b' />
              </foo>")

# function to save and read object
roundtrip <- function(obj) {
  tf <- tempfile()
  on.exit(unlink(tf))

  write_rds(obj, tf)
  read_rds(tf)
}

list(x)
#> [[1]]
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
roundtrip(list(x))
#> [[1]]
#> {xml_document}

identical(x, roundtrip(x))
#> [1] FALSE
all.equal(x, roundtrip(x))
#> [1] TRUE
xml_children(roundtrip(x))
#> Error in fun(x$node, ...): external pointer is not valid
as_list(roundtrip(x))
#> Error in fun(x$node, ...): external pointer is not valid

Some context

I have around 500,000 xml-files. To process them I planned on turning them into a list with xml2::as_list and I wrote code to extract what I need. Afterwards I realized, that as_list is very expensive to run. I could either:

  1. re-write already carefully debugged code to parse data directly (xml_child, xml_text, ...), or
  2. use as_list.

In order to speed up no. 2 I could run it on another machine with more cores, but I would like to pass a single file to that machine, because collecting and copying all files is time-consuming.

Thomas K
  • 3,242
  • 15
  • 29

1 Answers1

4

xml2 objects have external pointers that become invalid when you serialize them naively. The package provides xml_serialize() and xml_unserialize() objects to handle this for you. Unfortunately the API is slightly cumbersome because base::serialize() and base::unserialize() assume an open connection.


library(xml2)

x <- read_xml("<foo>
              <bar>text <baz id = 'a' /></bar>
              <bar>2</bar>
              <baz id = 'b' />
              </foo>")

# function to save and read object
roundtrip <- function(obj) {
  tf <- tempfile()
  con <- file(tf, "wb")
  on.exit(unlink(tf))

  xml_serialize(obj, con)
  close(con)
  con <- file(tf, "rb")
  on.exit(close(con), add = TRUE)
  xml_unserialize(con)
}
x
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
(y <- roundtrip(x))
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>

identical(x, y)
#> [1] FALSE
all.equal(x, y)
#> [1] TRUE
xml_children(y)
#> {xml_nodeset (3)}
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
as_list(y)
#> $bar
#> $bar[[1]]
#> [1] "text "
#> 
#> $bar$baz
#> list()
#> attr(,"id")
#> [1] "a"
#> 
#> 
#> $bar
#> $bar[[1]]
#> [1] "2"
#> 
#> 
#> $baz
#> list()
#> attr(,"id")
#> [1] "b"

Also in regards to the second part of your question, I would seriously consider using XPATH expressions to extract the desired data, even if you have to rewrite code.

Jim
  • 4,687
  • 29
  • 30
  • Thanks! Could you elaborate though why you would recommend using XPATH expressions? Viewing XML-documents in order to understand the structure felt a lot more cumbersome than something like `listviewer::jsonedit`. That's why I initially settled on working with lists instead. – Thomas K May 19 '17 at 21:24
  • You said you had 500k documents to parse. Xpath extracting just the elements you are interested in is going to run much faster than converting the entire data to a list first then manipulating that. – Jim May 21 '17 at 17:51
  • That was the reason for my post. Extracting a single element with XPATH is ~17 times faster than with `as_list` in my case. I guess, I will re-write, since it is more flexible to work with XPATH, once you learn how to deal with it. Thanks anyway! – Thomas K May 22 '17 at 16:37