0

Taking this example XML

<body>
  <items>
    <item>
      <name>Peter</name>
    </item>
  </items>
  <items>
    <item>
      <name>Paul</name>
    </item>
    <item>
      <name>Claudia</name>
    </item>
  </items>
  <items/>
</body> 

Question: What is the easiest way to get the following result?

"Peter"   "Paul"   ""

By now i achieve this as follows:

require(rvest)
require(magrittr)
my_xml <- xml("<items><item><name>Peter</name></item></items><items><item><name>Paul</name></item><item><name>Claudia</name></item></items><items></items>")
items <- my_xml %>% xml_nodes("items") %>% xml_node("item")
sapply(items, function(x){
  if(is.null(x)){
    ""
  } else {
    x %>% xml_node("name") %>% xml_text()
  }
})

To me this sapply construction seams like mistreating either rvest or css-selectors.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
Rentrop
  • 20,979
  • 10
  • 72
  • 100
  • my answer is only slightly different than your approach, but it may be more of what you're looking for. – hrbrmstr Oct 01 '15 at 11:32

1 Answers1

2

rvest really isn't needed since this is pure XML (and you end up using xml2 constructs anyway):

library(xml2)

doc <- read_xml("<body>
  <items>
    <item>
      <name>Peter</name>
    </item>
  </items>
  <items>
    <item>
      <name>Paul</name>
    </item>
    <item>
      <name>Claudia</name>
    </item>
  </items>
  <items/>
</body>")


sapply(xml_find_all(doc, "//items"), function(x) {
  val <- xml_text(xml_find_all(x, "./item[1]/name"))
  ifelse(length(val)>0, val, "")
})

## [1] "Peter" "Paul"  ""     

(sometimes XPath can be better than CSS)

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • Thanks for the answer... unfortunatly this isn't a really more simple solution. Maybe someone else is coming up with a nicer approach. Thanks anyways – Rentrop Oct 01 '15 at 13:59
  • It's the best you're going to get since you want to capture the "missing" values. That's a node-by-node operation no matter what and mine is a tad more efficient since it doesn't have to do a CSS->XML conversion (`rvest` does that magically for you), has fewer method dispatches (`rvest` routines just call `xml2` ones) and more efficiently gets at the `node`. So it actually is a more simple solution. – hrbrmstr Oct 01 '15 at 14:04
  • I think xml2 needs a better way to deal with this problem – hadley Oct 04 '15 at 20:05
  • Another possibility (it's still cumbersome) is to do `xml_find_all(doc, "//items[not(item)]")` which (for those who haven't misspent their youth staring at XPath constructs) will return all of the nodes without the desired sub-nodes (and then do processing that way). Prbly either a broader wrapper function would be needed in `xml2` _or_ the ability to specify an R callback function per node (that could then do more intelligent processing and return 'saner' things) since (sadly) there's no free lunch for this in the `libxml2`. – hrbrmstr Oct 04 '15 at 20:15