0

I am trying to scrape voting data from the website of the Russian parliament. I am working with nearly 600 webpages, and I am trying to scrape data from within those pages as well. Here is the code I have written thus far:

# load packages
library(rvest)
library(purrr)
library(writexl)

# base url
base_url <- sprintf("http://vote.duma.gov.ru/?convocation=AAAAAAA6&sort=date_asc&page=%d", 1:789)

# loop over pages
map_df(base_url, function(i) {
  pg <- read_html(i)
  tibble(
    title = html_nodes(pg, ".item-left a") %>% html_text() %>%  str_trim(),
    link = html_elements(pg, '.item-left a') %>% 
      html_attr('href') %>% 
      paste0('http://vote.duma.gov.ru', .),
  )
  
}) -> duma_votes_data

The above code executed successfully. This results in a df containing the titles and links. I am now trying to extract the date information. Here is the code I have written for that:

# extract date of vote 
duma_votes_data$date <- map(duma_votes_data$link, ~ {
  .x %>%
    read_html() %>%
    html_nodes(".date-p span") %>%
    html_text() %>%
    paste(collapse = " ")
})

After running this code, I receive the following error:

Error in open.connection(x, "rb") : HTTP error 504.

What is the best way to get around this issue? I have read about the possibility of incorporating Sys.sleep() to my code, but I am not sure where it should go. Note that this code is for all 789 pages, as indicated in base_url. The code does work with around 40 pages, so I guess worst case scenario I could do everything in small chunks and save the resulting dfs as a single df.

Phil Dukhov
  • 67,741
  • 15
  • 184
  • 220
w5698
  • 159
  • 7
  • You will always have to expect errors - no matter what you do to avoid them. – Felix Dec 02 '21 at 18:48
  • Here `Error 504` may be because the `link` or `node` doesn't exist. I don't think `Sys.sleep()` is helpful here. To set timeout refer [here](https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest). Or else you can use `tryCatch` to skip the error and move on. Finally, you can try `RSelenium`. – Nad Pat Dec 02 '21 at 19:26

0 Answers0