rvest scrape multiple values per node

Question

Taking this example XML

<body>
  <items>
    <item>
      <name>Peter</name>
    </item>
  </items>
  <items>
    <item>
      <name>Paul</name>
    </item>
    <item>
      <name>Claudia</name>
    </item>
  </items>
  <items/>
</body>

Question: What is the easiest way to get the following result?

"Peter"   "Paul"   ""

By now i achieve this as follows:

require(rvest)
require(magrittr)
my_xml <- xml("<items><item><name>Peter</name></item></items><items><item><name>Paul</name></item><item><name>Claudia</name></item></items><items></items>")
items <- my_xml %>% xml_nodes("items") %>% xml_node("item")
sapply(items, function(x){
  if(is.null(x)){
    ""
  } else {
    x %>% xml_node("name") %>% xml_text()
  }
})

To me this sapply construction seams like mistreating either rvest or css-selectors.

my answer is only slightly different than your approach, but it may be more of what you're looking for. — hrbrmstr, Oct 01 '15 at 11:32

score 2 · Accepted Answer · answered Oct 01 '15 at 11:30

2

rvest really isn't needed since this is pure XML (and you end up using xml2 constructs anyway):

library(xml2)

doc <- read_xml("<body>
  <items>
    <item>
      <name>Peter</name>
    </item>
  </items>
  <items>
    <item>
      <name>Paul</name>
    </item>
    <item>
      <name>Claudia</name>
    </item>
  </items>
  <items/>
</body>")


sapply(xml_find_all(doc, "//items"), function(x) {
  val <- xml_text(xml_find_all(x, "./item[1]/name"))
  ifelse(length(val)>0, val, "")
})

## [1] "Peter" "Paul"  ""

(sometimes XPath can be better than CSS)

answered Oct 01 '15 at 11:30

hrbrmstr

77,368
11
139
205

Thanks for the answer... unfortunatly this isn't a really more simple solution. Maybe someone else is coming up with a nicer approach. Thanks anyways – Rentrop Oct 01 '15 at 13:59
It's the best you're going to get since you want to capture the "missing" values. That's a node-by-node operation no matter what and mine is a tad more efficient since it doesn't have to do a CSS->XML conversion (`rvest` does that magically for you), has fewer method dispatches (`rvest` routines just call `xml2` ones) and more efficiently gets at the `node`. So it actually is a more simple solution. – hrbrmstr Oct 01 '15 at 14:04
I think xml2 needs a better way to deal with this problem – hadley Oct 04 '15 at 20:05
Another possibility (it's still cumbersome) is to do `xml_find_all(doc, "//items[not(item)]")` which (for those who haven't misspent their youth staring at XPath constructs) will return all of the nodes without the desired sub-nodes (and then do processing that way). Prbly either a broader wrapper function would be needed in `xml2` _or_ the ability to specify an R callback function per node (that could then do more intelligent processing and return 'saner' things) since (sadly) there's no free lunch for this in the `libxml2`. – hrbrmstr Oct 04 '15 at 20:15

rvest scrape multiple values per node

1 Answers1