I wish to scrape a web page that uses tabs. I didn't have much luck with rvest
so I am trying splashR
.
Splash is a headless browser designed specifically for web scraping. As mentioned in this introduction, you will need access to a Splash environment. I recommend using Docker containers as described in the introduction. I had never used Docker before, but in this case, it is straightforward to set up.
I am seeking rates for non-cashable GICs which requires clicking a tab at the top of the page.
With this:
library(HARtools)
library(splashr)
library(tidyverse)
td_raw <- possibly(render_har, "bad")(url = "https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/", wait = 5, response_body = TRUE)
har_entries(td_raw) %>%
purrr::map_chr(get_content_type) %>%
table(dnn = "content_type") %>%
broom::tidy() %>%
dplyr::arrange(desc(n))
I get a list of the types of content on the page:
# A tibble: 11 x 2
content_type n
<chr> <int>
1 image/gif 118
2 application/javascript 69
3 text/javascript 48
4 text/html 26
5 image/png 10
6 application/font-woff2 6
7 application/json 5
8 application/x-javascript 4
9 text/css 4
10 image/svg+xml 3
11 text/plain 1
Depending on how you count there are 5 tables on the page. So having a look at what's in application/json
is a start.
This gives me an error:
har_entries(td_raw) %>%
purrr::keep(is_json) %>%
purrr::map(get_response_body, "text") %>%
purrr::map(jsonlite::fromJSON)
Error: lexical error: invalid char in json text.
EGAINCLOUD._callback.eg3c465e5d
(right here) ------^
How do I get to having the rate information in a 'friendlier' dataframe format?