1

I wish to scrape a web page that uses tabs. I didn't have much luck with rvest so I am trying splashR.

Splash is a headless browser designed specifically for web scraping. As mentioned in this introduction, you will need access to a Splash environment. I recommend using Docker containers as described in the introduction. I had never used Docker before, but in this case, it is straightforward to set up.

I am seeking rates for non-cashable GICs which requires clicking a tab at the top of the page.

With this:

library(HARtools)
library(splashr)
library(tidyverse)

td_raw <- possibly(render_har, "bad")(url = "https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/", wait = 5, response_body = TRUE)

har_entries(td_raw) %>% 
  purrr::map_chr(get_content_type) %>% 
  table(dnn = "content_type") %>% 
  broom::tidy() %>% 
  dplyr::arrange(desc(n))

I get a list of the types of content on the page:

# A tibble: 11 x 2
   content_type                 n
   <chr>                    <int>
 1 image/gif                  118
 2 application/javascript      69
 3 text/javascript             48
 4 text/html                   26
 5 image/png                   10
 6 application/font-woff2       6
 7 application/json             5
 8 application/x-javascript     4
 9 text/css                     4
10 image/svg+xml                3
11 text/plain                   1

Depending on how you count there are 5 tables on the page. So having a look at what's in application/json is a start.

This gives me an error:

har_entries(td_raw) %>% 
  purrr::keep(is_json) %>% 
  purrr::map(get_response_body, "text") %>% 
  purrr::map(jsonlite::fromJSON)

Error: lexical error: invalid char in json text.
                                       EGAINCLOUD._callback.eg3c465e5d
                     (right here) ------^

How do I get to having the rate information in a 'friendlier' dataframe format?

ixodid
  • 2,180
  • 1
  • 19
  • 46
  • I get an error on your initial code with _Error in names(df) <- repaired_names(c(names2(dimnames(x)), n), .name_repair = .name_repair, : 'names' attribute [2] must be the same length as the vector [1]_ – QHarr Nov 08 '20 at 07:17
  • Not sure what to say. No errors running the first block of code above. There is a warning: 'tidy.table' is deprecated. See help("Deprecated") – ixodid Nov 08 '20 at 07:20
  • Ah... it is because it is returning "bad" – QHarr Nov 08 '20 at 07:34
  • Does the url work for you? – ixodid Nov 08 '20 at 07:38
  • the url works in the browser but not when I have used as in your code example. I haven't used splashR before but assume there is a status code return somewhere? – QHarr Nov 08 '20 at 08:32
  • Ah.... I see there is additional set-up required to have a running instance of splash by the looks of it – QHarr Nov 08 '20 at 08:38
  • Why are you not trying {RSelenium}? – Indranil Gayen Nov 17 '20 at 06:57

0 Answers0