Write xml-object to disk

Question

I have a big bunch of xml-files, which I need to process. For that matter I want to be able to read the files, and save the resulting list of objects to disk. I tried to save the list with readr::write_rds, but after reading it in again, the object is somewhat modified, and not valid any more. Is there anything I can do to alleviate this problem?

library(readr)
library(xml2)

x <- read_xml("<foo>
              <bar>text <baz id = 'a' /></bar>
              <bar>2</bar>
              <baz id = 'b' />
              </foo>")

# function to save and read object
roundtrip <- function(obj) {
  tf <- tempfile()
  on.exit(unlink(tf))

  write_rds(obj, tf)
  read_rds(tf)
}

list(x)
#> [[1]]
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
roundtrip(list(x))
#> [[1]]
#> {xml_document}

identical(x, roundtrip(x))
#> [1] FALSE
all.equal(x, roundtrip(x))
#> [1] TRUE
xml_children(roundtrip(x))
#> Error in fun(x$node, ...): external pointer is not valid
as_list(roundtrip(x))
#> Error in fun(x$node, ...): external pointer is not valid

Some context

I have around 500,000 xml-files. To process them I planned on turning them into a list with xml2::as_list and I wrote code to extract what I need. Afterwards I realized, that as_list is very expensive to run. I could either:

re-write already carefully debugged code to parse data directly (xml_child, xml_text, ...), or
use as_list.

In order to speed up no. 2 I could run it on another machine with more cores, but I would like to pass a single file to that machine, because collecting and copying all files is time-consuming.

score 4 · Accepted Answer · answered May 19 '17 at 20:45

xml2 objects have external pointers that become invalid when you serialize them naively. The package provides xml_serialize() and xml_unserialize() objects to handle this for you. Unfortunately the API is slightly cumbersome because base::serialize() and base::unserialize() assume an open connection.

library(xml2)

x <- read_xml("<foo>
              <bar>text <baz id = 'a' /></bar>
              <bar>2</bar>
              <baz id = 'b' />
              </foo>")

# function to save and read object
roundtrip <- function(obj) {
  tf <- tempfile()
  con <- file(tf, "wb")
  on.exit(unlink(tf))

  xml_serialize(obj, con)
  close(con)
  con <- file(tf, "rb")
  on.exit(close(con), add = TRUE)
  xml_unserialize(con)
}
x
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
(y <- roundtrip(x))
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>

identical(x, y)
#> [1] FALSE
all.equal(x, y)
#> [1] TRUE
xml_children(y)
#> {xml_nodeset (3)}
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
as_list(y)
#> $bar
#> $bar[[1]]
#> [1] "text "
#> 
#> $bar$baz
#> list()
#> attr(,"id")
#> [1] "a"
#> 
#> 
#> $bar
#> $bar[[1]]
#> [1] "2"
#> 
#> 
#> $baz
#> list()
#> attr(,"id")
#> [1] "b"

Also in regards to the second part of your question, I would seriously consider using XPATH expressions to extract the desired data, even if you have to rewrite code.

Thanks! Could you elaborate though why you would recommend using XPATH expressions? Viewing XML-documents in order to understand the structure felt a lot more cumbersome than something like `listviewer::jsonedit`. That's why I initially settled on working with lists instead. — Thomas K, May 19 '17 at 21:24
You said you had 500k documents to parse. Xpath extracting just the elements you are interested in is going to run much faster than converting the entire data to a list first then manipulating that. — Jim, May 21 '17 at 17:51
That was the reason for my post. Extracting a single element with XPATH is ~17 times faster than with `as_list` in my case. I guess, I will re-write, since it is more flexible to work with XPATH, once you learn how to deal with it. Thanks anyway! — Thomas K, May 22 '17 at 16:37

Write xml-object to disk

Some context

1 Answers1

Linked

Related