0

I wrote some code to webscrape air quality data in R. It worked perfectly fine and I had no issues. But now, when I recently reran it, I'm getting an error when using the html_nodes() function.

Here is my code:

library(rvest)
library(tidyverse)
library(lubridate)


## Download MOE Location Data
# https://stackoverflow.com/questions/25677035/how-to-create-a-range-of-dates-in-r

## Create a tibble of dates
start_date <- "2021/1/1"
end_date <- "2021/12/31"

dates <- seq(as.Date(start_date), as.Date(end_date), "days")

df <- NULL

for (datex in dates) {
  datef = as.Date(datex, origin = "1970-01-01")
  Day = day(datef)
  Month = month(datef)
  Year = year(datef)
  for (hour in 1:24) {
    url.new <-
      paste(
        "http://www.airqualityontario.com/aqhi/locations.php?start_day=",
        Day,
        "&start_month=",
        Month,
        "&start_year=",
        Year,
        "&my_hour=",
        hour,
        "&pol=36&text_only=1&Submit=Update",
        sep = ""
      )
    download.file(url.new, destfile = "scrapedpage.html", quiet=TRUE)
    simple <- read_html("scrapedpage.html")
    test <- simple %>%
      html_nodes("td") %>%
      html_text()
    test <- as_tibble(test)
    df.temp <-
      as.data.frame(matrix(
        unlist(test, use.names = FALSE),
        ncol = 3,
        byrow = TRUE
      )) %>%
      mutate(date = paste(datef)) %>%
      mutate(hour = hour)
    df <- rbind(df, df.temp)
    
  }
}


df <- as_tibble(df)

colnames(df) <- c("Station","Address","SurfaceConc","SurfaceDate","Hour")

MOE_data <- df %>%
  filter(Address != "Bay St. Wellesley St. W.") %>%
  select(-Address) %>%
  mutate(Station = trimws(Station)) %>%
  # filter(str_detect(Station, 'Toronto')) %>%
  mutate(Hour = paste(Hour, ":00:00", sep = "")) %>%
  mutate(Hour = hms::as_hms(Hour)) %>%
  mutate(SurfaceDate = paste(SurfaceDate, Hour)) %>%
  mutate(SurfaceDate = as_datetime(SurfaceDate)) %>%
  select(-Hour) 

MOE_data <- as_tibble(MOE_data)

rm(list=setdiff(ls(), "MOE_data_2021"))
# save.image(file='Jan2019_Dec2021.RData')

This is the error I get:

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "xml_document"

What I don't understand is why it happens for some values, some of the time. For example, I get an error when the hour = 16. But when I rerun it, it may work, it's just not consistent.

  • 1
    It looks like you are downloading and then reading the html page. There may be problems with download taking longer than expected thus generating the error. A couple of things to try is to put a slight pause in after the download.file, `Sys.sleep(0.5)` or try reading the url directly with `read_html(url.new)` – Dave2e Jan 07 '22 at 18:32
  • 1
    I used the second suggestion and it worked. Thank you! – Priya Patel Jan 07 '22 at 23:12

0 Answers0