Error when web scraping in R: Error in UseMethod("xml_find_all") :

Question

I wrote some code to webscrape air quality data in R. It worked perfectly fine and I had no issues. But now, when I recently reran it, I'm getting an error when using the html_nodes() function.

Here is my code:

library(rvest)
library(tidyverse)
library(lubridate)


## Download MOE Location Data
# https://stackoverflow.com/questions/25677035/how-to-create-a-range-of-dates-in-r

## Create a tibble of dates
start_date <- "2021/1/1"
end_date <- "2021/12/31"

dates <- seq(as.Date(start_date), as.Date(end_date), "days")

df <- NULL

for (datex in dates) {
  datef = as.Date(datex, origin = "1970-01-01")
  Day = day(datef)
  Month = month(datef)
  Year = year(datef)
  for (hour in 1:24) {
    url.new <-
      paste(
        "http://www.airqualityontario.com/aqhi/locations.php?start_day=",
        Day,
        "&start_month=",
        Month,
        "&start_year=",
        Year,
        "&my_hour=",
        hour,
        "&pol=36&text_only=1&Submit=Update",
        sep = ""
      )
    download.file(url.new, destfile = "scrapedpage.html", quiet=TRUE)
    simple <- read_html("scrapedpage.html")
    test <- simple %>%
      html_nodes("td") %>%
      html_text()
    test <- as_tibble(test)
    df.temp <-
      as.data.frame(matrix(
        unlist(test, use.names = FALSE),
        ncol = 3,
        byrow = TRUE
      )) %>%
      mutate(date = paste(datef)) %>%
      mutate(hour = hour)
    df <- rbind(df, df.temp)
    
  }
}


df <- as_tibble(df)

colnames(df) <- c("Station","Address","SurfaceConc","SurfaceDate","Hour")

MOE_data <- df %>%
  filter(Address != "Bay St. Wellesley St. W.") %>%
  select(-Address) %>%
  mutate(Station = trimws(Station)) %>%
  # filter(str_detect(Station, 'Toronto')) %>%
  mutate(Hour = paste(Hour, ":00:00", sep = "")) %>%
  mutate(Hour = hms::as_hms(Hour)) %>%
  mutate(SurfaceDate = paste(SurfaceDate, Hour)) %>%
  mutate(SurfaceDate = as_datetime(SurfaceDate)) %>%
  select(-Hour) 

MOE_data <- as_tibble(MOE_data)

rm(list=setdiff(ls(), "MOE_data_2021"))
# save.image(file='Jan2019_Dec2021.RData')

This is the error I get:

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "xml_document"

What I don't understand is why it happens for some values, some of the time. For example, I get an error when the hour = 16. But when I rerun it, it may work, it's just not consistent.

It looks like you are downloading and then reading the html page. There may be problems with download taking longer than expected thus generating the error. A couple of things to try is to put a slight pause in after the download.file, `Sys.sleep(0.5)` or try reading the url directly with `read_html(url.new)` — Dave2e, Jan 07 '22 at 18:32

Error when web scraping in R: Error in UseMethod("xml_find_all") :

0 Answers0