0

I am trying to scrape a webpage which uses angular.js. My understanding is that the only option in R is to use RSelenium to load the page first, and then parse the content. However, I find rvest more intuitive than RSelenium to parse the content, thus I would like to work as little as possible with RSelenium and then switch to rvest as soon as I can.

So far I have realized that I probably need at least to use RSelenium to connect and download the html code using htmlTreeParse. Suppose this is part of my output:

structure(list(name = "div", attributes = structure(c("im_dialog_date", 
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
    text = structure(list(name = "text", attributes = NULL, children = NULL, 
        namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name", 
    "attributes", "children", "namespace", "namespaceDefinitions", 
    "value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode", 
    "XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL, 
    namespaceDefinitions = NULL), .Names = c("name", "attributes", 
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode", 
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))

How can I pass it to rvest::read_html()?

alistaire
  • 42,459
  • 4
  • 77
  • 117
Dambo
  • 3,318
  • 5
  • 30
  • 79
  • 1
    I would suspect you would need to bypass `read_html`, not feed it. The purpose of `read_html` is to download data so that follow-on functions (e.g., `html_nodes`) can do something with it. Unfortunately, brief inspection of the output from `read_html` suggests it isn't trivial, as it contains no real data, just pointers. This could be a number of things, but will be much harder to reverse-engineer. Perhaps you should be looking into using `xml2` directly instead of through `rvest`? – r2evans Sep 03 '17 at 01:01

1 Answers1

3

If you look at the class of your item, it's an XMLNode, which is a class defined by the XML package. In it, it defines a method for toString (but not as.character, curiously) that allows you to convert the node to an ordinary string, which can in turn be read in by xml2::read_html:

library(rvest)
#> Loading required package: xml2

node <- structure(list(name = "div", attributes = structure(c("im_dialog_date", 
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
    text = structure(list(name = "text", attributes = NULL, children = NULL, 
        namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name", 
    "attributes", "children", "namespace", "namespaceDefinitions", 
    "value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode", 
    "XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL, 
    namespaceDefinitions = NULL), .Names = c("name", "attributes", 
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode", 
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))

node %>% XML::toString.XMLNode() %>% read_html()
#> {xml_document}
#> <html>
#> [1] <body><div class="im_dialog_date" ng-bind="dialogMessage.dateText">6 ...

That said, I normally just use the RSelenium::remoteDriver's getPageSource() method to just grab all the HTML, which is then easily parsed with rvest.

alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Yep! Just get your remote driver to the page you want (with JavaScript run, logged in, forms submitted, buttons clicked, whatever) and then grab the page source directly instead of trying to select the node within RSelenium. – alistaire Sep 03 '17 at 01:13
  • 1
    I was getting an error, I just realized the reason is that I needed to pass the content of the list thus `read_html(remDr$getPageSource()[[1]])` – Dambo Sep 03 '17 at 01:20
  • Another option is @hrbrmstr's new [splashr](https://github.com/hrbrmstr/splashr) package, which pipes nicely and whose `render_html` and `splash_html` functions return HTML already read in by `xml2::read_html`. – alistaire Sep 03 '17 at 01:20
  • Ah, yeah, forgot what the output of that looked like, sorry. – alistaire Sep 03 '17 at 01:20