trouble scraping html table data in an interval with rvest

Question

Two weeks ago I asked how to scrape html tables with nested columns. With all your help, I can scrape data for one particular day and filter out irrelevant row information:

library(rvest)
library(dplyr)
library(tidyverse)

theDate <- Sys.Date() - 7
theDateInNumber <- gsub("\\-", "", Sys.Date() - 7)

url_data <- paste0("https://www.immd.gov.hk/eng/stat_", theDateInNumber, ".html")

rows <- read_html(url_data) %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())

df <- map_dfr(
  rows,
  function(x) {
x %>%
  html_elements("td[headers]") %>%
  set_names(headers) %>%
  html_text()
  }
) %>%
  filter(Control_Point %in% c("Airport")) %>% #select only airport data
  mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer())) %>%
  mutate(date = theDate)

write.csv(df, "immigrationStatistics.csv")

view(df)

This time I try to scrape the same type of data -- airport travel figures in any date range. My goal is to obtain a table of airport traffic and plot a line chart on population change in an interval. But I find trouble in iteration.

My code is as follows:

library(rvest)
library(dplyr)
library(tidyverse)


start <- as.Date("01-09-22", format = "%d-%m-%y")
end   <- as.Date("30-09-22", format = "%d-%m-%y")


prefixes <- c("arr", "dep")
cols <-
  c("Hong Kong Residents",
    "Mainland Visitors",
    "Other Visitors",
    "Total")
headers <-
  c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())


theDate <- start
while (theDate <= end)
{
  url_data <-
    print(paste0("https://www.immd.gov.hk/eng/stat_", format(theDate, "%Y%m%d"), ".html"
    ))
  
  rows <-
    read_html(url_data) %>% html_elements(".table-passengerTrafficStat tbody tr")
 
  df <- map_dfr(rows,
                function(x) {
                  x %>%
                    html_elements("td[headers]") %>%
                    set_names(headers) %>%
                    html_text()
                }) %>%
    filter(Control_Point %in% c("Airport")) %>% #select only airport data
    mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer())) %>%
    mutate(date = theDate - 1) %>%
    write.csv(df, "immigrationStatistics.csv")
  
  theDate <- theDate + 1
}
view(df)

May I know why and where the error occurs? And how to fix the iteration method? The console complains that:

[1] "https://www.immd.gov.hk/eng/stat_20220901.html"
Error in file == "" : 
  comparison (1) is possible only for atomic and list types
> view(df)
Error in checkHT(n, dim(x)) : 
  invalid 'n' -  must contain at least one non-missing element, got none.

Thanks a million in advance.

score 1 · Accepted Answer · answered Oct 01 '22 at 14:54

I was unable to reproduce your error. However, I did made the change of collecting the results of each loop into a list and then writing the information to a file just once. It looks like your original code would overwrite the data file on each iteration.

library(rvest)
library(dplyr)
library(purrr)
library(stringr)

start <- as.Date("01-09-22", format = "%d-%m-%y")
end   <- as.Date("3-09-22", format = "%d-%m-%y")

prefixes <- c("arr", "dep")
cols <-
   c("Hong Kong Residents",
     "Mainland Visitors",
     "Other Visitors",
     "Total")
headers <-
   c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())

answer <- list()
theDate <- start
while (theDate <= end) {
   url_data <-
      print(paste0("https://www.immd.gov.hk/eng/stat_", format(theDate, "%Y%m%d"), ".html"
      ))
   
   rows <-
      read_html(url_data) %>% html_elements(".table-passengerTrafficStat tbody tr")
   
   df <- map_dfr(rows,
                 function(x) {
                    x %>%
                       html_elements("td[headers]") %>%
                       set_names(headers) %>%
                       html_text()
                 })  %>%
      filter(Control_Point %in% c("Airport")) %>% #select only airport data
      mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer())) %>%
      mutate(date = theDate - 1)         
   answer[[theDate]] <-df
      
   theDate <- theDate + 1
   Sys.sleep(1)
}
#bind_rows(answer)
write.csv(bind_rows(answer), "immigrationStatistics.csv")

The final change was to add a slight pause as not to appear as an attack.

Thanks for your help @Dave2e , and yes, need a Sys.sleep to avoid being misunderstood as ddos! So my error is writing the file too early and only write for the last iteration. But I don't understand which line of code you require R to memorize all the scraped data before writing to the csv file. I guess there would be some explicit syntax like 'append()' one row at a time. — ronzenith, Oct 01 '22 at 16:35
Before the loop starts, I am defining an empty list "answer" and then appending to it within the loop. After the loop, `bind_rows()` combines the individual rows together into the final answer. — Dave2e, Oct 01 '22 at 16:49
I realize that I do not have the awareness of defining some empty stuff before looping, such as empty lists, empty vector, empty arrays. May I know why you can remember / know you need to add such syntax in the first place. @dave2e — ronzenith, Oct 01 '22 at 17:00

trouble scraping html table data in an interval with rvest

1 Answers1