I am trying to scrape voting data from the website of the Russian parliament. I am working with nearly 600 webpages, and I am trying to scrape data from within those pages as well. Here is the code I have written thus far:
# load packages
library(rvest)
library(purrr)
library(writexl)
# base url
base_url <- sprintf("http://vote.duma.gov.ru/?convocation=AAAAAAA6&sort=date_asc&page=%d", 1:789)
# loop over pages
map_df(base_url, function(i) {
pg <- read_html(i)
tibble(
title = html_nodes(pg, ".item-left a") %>% html_text() %>% str_trim(),
link = html_elements(pg, '.item-left a') %>%
html_attr('href') %>%
paste0('http://vote.duma.gov.ru', .),
)
}) -> duma_votes_data
The above code executed successfully. This results in a df
containing the titles and links. I am now trying to extract the date information. Here is the code I have written for that:
# extract date of vote
duma_votes_data$date <- map(duma_votes_data$link, ~ {
.x %>%
read_html() %>%
html_nodes(".date-p span") %>%
html_text() %>%
paste(collapse = " ")
})
After running this code, I receive the following error:
Error in open.connection(x, "rb") : HTTP error 504.
What is the best way to get around this issue? I have read about the possibility of incorporating Sys.sleep()
to my code, but I am not sure where it should go. Note that this code is for all 789 pages, as indicated in base_url
. The code does work with around 40 pages, so I guess worst case scenario I could do everything in small chunks and save the resulting dfs as a single df.