I wrote some code to webscrape air quality data in R. It worked perfectly fine and I had no issues. But now, when I recently reran it, I'm getting an error when using the html_nodes() function.
Here is my code:
library(rvest)
library(tidyverse)
library(lubridate)
## Download MOE Location Data
# https://stackoverflow.com/questions/25677035/how-to-create-a-range-of-dates-in-r
## Create a tibble of dates
start_date <- "2021/1/1"
end_date <- "2021/12/31"
dates <- seq(as.Date(start_date), as.Date(end_date), "days")
df <- NULL
for (datex in dates) {
datef = as.Date(datex, origin = "1970-01-01")
Day = day(datef)
Month = month(datef)
Year = year(datef)
for (hour in 1:24) {
url.new <-
paste(
"http://www.airqualityontario.com/aqhi/locations.php?start_day=",
Day,
"&start_month=",
Month,
"&start_year=",
Year,
"&my_hour=",
hour,
"&pol=36&text_only=1&Submit=Update",
sep = ""
)
download.file(url.new, destfile = "scrapedpage.html", quiet=TRUE)
simple <- read_html("scrapedpage.html")
test <- simple %>%
html_nodes("td") %>%
html_text()
test <- as_tibble(test)
df.temp <-
as.data.frame(matrix(
unlist(test, use.names = FALSE),
ncol = 3,
byrow = TRUE
)) %>%
mutate(date = paste(datef)) %>%
mutate(hour = hour)
df <- rbind(df, df.temp)
}
}
df <- as_tibble(df)
colnames(df) <- c("Station","Address","SurfaceConc","SurfaceDate","Hour")
MOE_data <- df %>%
filter(Address != "Bay St. Wellesley St. W.") %>%
select(-Address) %>%
mutate(Station = trimws(Station)) %>%
# filter(str_detect(Station, 'Toronto')) %>%
mutate(Hour = paste(Hour, ":00:00", sep = "")) %>%
mutate(Hour = hms::as_hms(Hour)) %>%
mutate(SurfaceDate = paste(SurfaceDate, Hour)) %>%
mutate(SurfaceDate = as_datetime(SurfaceDate)) %>%
select(-Hour)
MOE_data <- as_tibble(MOE_data)
rm(list=setdiff(ls(), "MOE_data_2021"))
# save.image(file='Jan2019_Dec2021.RData')
This is the error I get:
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "xml_document"
What I don't understand is why it happens for some values, some of the time. For example, I get an error when the hour = 16. But when I rerun it, it may work, it's just not consistent.