I'm trying to use splashr
to scrape a dynamic webpage, and it's been non-stop problems for me. During my scrape of get_box_score()
, I'll either get the errors
Error in execute_lua(splash_obj, call_function) :
Gateway Timeout (HTTP 504).
or
Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to
an object of class "xml_missing"
And honestly, once I "fix" one of the errors, I get the other. I have no idea if these are related, or if I'm just getting a lot of different unrelated errors with my code. Any idea how I can fix these? Here's my code:
library(tidyverse)
library(splashr)
library(rvest)
url <- "https://www.uscho.com/scoreboard/michigan/mens-hockey/"
# Everything should be fine for a while
get_data <- function(myurl) {
link_data <- myurl %>%
read_html() %>%
html_nodes("td:nth-child(13) a") %>%
html_attr("href") %>%
str_c("https://www.uscho.com", .) %>%
as_tibble() %>%
set_names("url")
game_type <- myurl %>%
read_html() %>%
html_nodes("td:nth-child(12)") %>%
html_text() %>%
as_tibble() %>%
set_names("game_type") %>%
filter(game_type != "Type")
as_tibble(data.frame(link_data, game_type))
}
link_list <- get_data(url)
urls <- link_list %>%
filter(game_type != "EX") %>%
pull(url)
# Here's where the fun starts
get_box_score <- function(my_url) {
progress_bar$tick()$print()
Sys.sleep(15)
splash_container <- start_splash()
on.exit(stop_splash(splash_container))
Sys.sleep(10)
mydata <- splash_local %>%
splash_response_body(TRUE) %>%
splash_user_agent(ua_win10_chrome) %>%
splash_go(my_url) %>%
splash_wait(runif(1, 5, 10)) %>%
splash_html() %>%
html_node("#boxgoals") %>%
html_table(fill = TRUE) %>%
as_tibble()
return(mydata)
}
progress_bar <- link_list %>%
filter(game_type != "EX") %>%
tally() %>%
progress_estimated(min_time = 0)
mydata <- pmap_df(list(urls), get_box_score)