0

I'm relatively new to R (and brand spanking new to scraping with R), so apologies in advance if I'm overlooking something obvious here!

I've been trying to learn how to scrape with RSelenium by following this tutorial: https://rawgit.com/petrkeil/Blog/master/2017_08_15_Web_scraping/web_scraping.html#advanced-scraping-with-rselenium

After running the following in Terminal (docker run -d -p 4445:4444 selenium/standalone-firefox), I tried to run the R code below, pulled with only slight modifications from the tutorial hyperlinked above:

get.tree <- function(genus, species) 
{
  # navigate to the page
  browser <- remoteDriver(port=4445L)
  browser$open(silent = T)

  browser$navigate("http://www.bgci.org/global_tree_search.php?sec=globaltreesearch")
  browser$refresh()

  # create r objects from the web search input and button elements

  genusElem <- browser$findElement(using = 'id', value = "genus-field")
  specElem <- browser$findElement(using = 'id', value = "species-field")
  buttonElem <- browser$fiendElement(using = 'class', value = "btn_ohoDO")

  # tell R to fill in the fields

  genusElem$sendKeysToElement(list(genus))
  specElem$sendKeysToElement(list(species))

  # tell R to click the search button

  buttonElem$clickElement()

  # get output

  out <- browser$findElement(using = "css", value = "td.cell_1O3UaG:nth-child(4)") # the country origin
  out <- out$getElementText()[[1]] # extract actual text string
  out <- strsplit(out, split = "; ")[[1]] # turns into character vector

  # close browser

  browser$close()

    return(out)
}

# Now let's try it:

get.tree("Abies", "alba")

But after doing all that, I get the following error:

Selenium message:Failed to decode response from marionette Build info: version: '3.6.0', revision: '6fbf3ec767', time: '2017-09-27T16:15:40.131Z' System info: host: 'd260fa60d69b', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '4.9.49-moby', java.version: '1.8.0_131' Driver info: driver.version: unknown

Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command. class: org.openqa.selenium.WebDriverException Further Details: run errorDetails method

Anyone have any idea what this means and where I went wrong?

Thanks very much for your help!

jubjub
  • 107
  • 8
  • 1
    Use google chrome or an older version of firefox as in the tutorial (`sudo docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0`). The issue with newer versions of Firefox is they are gradually switching to the w3c protocol. – jdharrison Nov 04 '17 at 10:30
  • @jdharrison I faced the same issue (I had been using version `3.11.0`), using version `2.53.0` works so far (let's hope that problem won't show up again). You might consider posting your comment as an answer (as it seems to solve the problem). – niko Apr 10 '18 at 14:35

1 Answers1

0

Just take advantage of the XHR request it makes to retrieve the in-line results and toss RSelenium:

library(httr)
library(tidyverse)

get_tree <-  function(genus, species) {

  GET(
    url = sprintf("https://data.bgci.org/treesearch/genus/%s/species/%s", genus, species), 
    add_headers(
      Origin = "http://www.bgci.org", 
      Referer = "http://www.bgci.org/global_tree_search.php?sec=globaltreesearch"
    )
  ) -> res

  stop_for_status(res)

  matches <- content(res, flatten=TRUE)$results[[1]]

  flatten_df(matches[c("id", "taxon", "family", "author", "source", "problems", "distributionDone", "note", "wcsp")]) %>% 
    mutate(geo = list(map_chr(matches$TSGeolinks, "country"))) %>% 
    mutate(taxas = list(map_chr(matches$TSTaxas, "checkTaxon")))

}

xdf <- get_tree("Abies", "alba")

xdf
## # A tibble: 1 x 8
##      id      taxon   family author     source distributionDone        geo      taxas
##   <int>      <chr>    <chr>  <chr>      <chr>            <chr>     <list>     <list>
## 1 58373 Abies alba Pinaceae  Mill. WCSP Phans              yes <chr [21]> <chr [45]>

glimpse(xdf)
## Observations: 1
## Variables: 8
## $ id               <int> 58373
## $ taxon            <chr> "Abies alba"
## $ family           <chr> "Pinaceae"
## $ author           <chr> "Mill."
## $ source           <chr> "WCSP Phans"
## $ distributionDone <chr> "yes"
## $ geo              <list> [<"Albania", "Andorra", "Austria", "Bulgaria", "Croatia", "Czech Republic", "Fr...
## $ taxas            <list> [<"Abies abies", "Abies alba f. columnaris", "Abies alba f. compacta", "Abies a...

It's highly likely you'll need to modify get_tree() at some point but it's better than having Selenium or Splash or phantomjs or Headless Chrome as a dependency.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • As a package developer yourself you would probably agree that it is discourteous to fellow developers to cast slur. If you provide a helpful answer great does it need the emotive extra's? – jdharrison Nov 04 '17 at 10:29
  • I meant Selenium @jdharrison not your package. It was a typo. – hrbrmstr Nov 04 '17 at 10:54
  • 1
    And said typo is fixed and three other rather significant alternative external technology dependencies have been added. Perhaps assume mistake vs malice in the future? – hrbrmstr Nov 04 '17 at 10:55