I have a list of 38,000+ urls. Each url has a table that I would like to scrape. For example:
library(rvest)
library(magrittr)
library(plyr)
#Doing URLs one by one
url<-"http://www.acpafl.org/ParcelResults.asp?Parcel=00024-006-000"
##GET SALES DATA
pricesdata <- read_html(url) %>% html_nodes(xpath = "//table[4]") %>% html_table(fill=TRUE)
library(plyr)
df <- ldply(pricesdata, data.frame)
I would like to generalize this to all urls in Parcels_ID.txt.
#(1) Step one is to generate a list of urls that we want to scrape data from
parcels <- read.csv(file="Parcels_ID.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE) #import the data
parcelIDs <- as.vector(parcels$PARCELID) #reformat the column as a vector containing parcel IDs as individual elements of the vector
parcels$url = paste("http://www.acpafl.org/ParcelResults.asp?Parcel=", parcelIDs, sep="") #paste the web address and the parcel ID together to get the link to the parcel on the website
Now that I have that, I would like to write a loop that goes through each url and pulls the table, and then puts the results in a list of dataframes. This is where I am having trouble:
#(2) Step to is to write a for loop that will scrape the tables from the individual pages
compiled<-list()
for (i in seq_along(parcels$url)){
##GET SALES DATA
pricesdata <- read_html(parcels$url[i]) %>% html_nodes(xpath = "//table[4]") %>% html_table(fill=TRUE)
compiled[[i]] <- ldply(pricesdata, data.frame)
}
This code never completes. I would appreciate any eagle eyes that can spot errors or issues, or any suggestions as to best practices to writing this for loop that makes a dataframe from the tables pulled from the websites.
Thank you