Scrape html table from long list of urls and make one list of dataframes

Question

I have a list of 38,000+ urls. Each url has a table that I would like to scrape. For example:

library(rvest)
library(magrittr)
library(plyr)


#Doing URLs one by one
url<-"http://www.acpafl.org/ParcelResults.asp?Parcel=00024-006-000"

##GET SALES DATA
pricesdata <- read_html(url) %>% html_nodes(xpath = "//table[4]") %>% html_table(fill=TRUE)
library(plyr)
df <- ldply(pricesdata, data.frame)

I would like to generalize this to all urls in Parcels_ID.txt.

#(1) Step one is to generate a list of urls that we want to scrape data from

parcels <- read.csv(file="Parcels_ID.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE) #import the data
parcelIDs <- as.vector(parcels$PARCELID) #reformat the column as a vector containing parcel IDs as individual elements of the vector
parcels$url = paste("http://www.acpafl.org/ParcelResults.asp?Parcel=", parcelIDs, sep="") #paste the web address and the parcel ID together to get the link to the parcel on the website

Now that I have that, I would like to write a loop that goes through each url and pulls the table, and then puts the results in a list of dataframes. This is where I am having trouble:

#(2) Step to is to write a for loop that will scrape the tables from the individual pages

compiled<-list()
for (i in seq_along(parcels$url)){

  ##GET SALES DATA
  pricesdata <- read_html(parcels$url[i]) %>% html_nodes(xpath = "//table[4]") %>% html_table(fill=TRUE)

  compiled[[i]] <- ldply(pricesdata, data.frame)

}

This code never completes. I would appreciate any eagle eyes that can spot errors or issues, or any suggestions as to best practices to writing this for loop that makes a dataframe from the tables pulled from the websites.

Thank you

Did you time how long it takes to process 1 page and then multiply that by 38,000? If each page took a second, you would expect to wait over 10 hours. — MrFlick, Dec 04 '15 at 20:18
@MrFlick True... Do you think there is anyway to speed this up? — user3795577, Dec 04 '15 at 20:26
(a) I'd ask the site owner for a content dump to be kind to their infrastructure. there's a good chance it's being generated from a database and it could save you alot of headache (b) if you're unwilling to let them know you're scraping their content by asking for a different format, then can you run GNU Parallel? If so, you can do something like http://stackoverflow.com/questions/8634109/parallel-download-using-curl-command-line-utility and then process the files on disk. — hrbrmstr, Dec 04 '15 at 21:45
i just went to that one example URL. it's definitely a live database call and doing a mass scraping without a timeout/delay would be really terrible of you. It looks to be an IIS site with a SQL Server back end and from the looks of it, they are a small org, so this is probably a pretty small server. Please consider that when attempting an unintended denial of service attack on them. — hrbrmstr, Dec 04 '15 at 21:54
@hrbrmstr I really appreciate the advice. Goes to show that a little bit of knowledge can be harmful. I am not intending to be harmful so it is good to know of the implications. Can you explain further what an unintended denial of service attack is? — user3795577, Dec 04 '15 at 22:54
@hrbrmstr Could you provide a little more detail on using GNU parallel for this along with that link? I've never used it before. — user3795577, Dec 04 '15 at 23:30

Scrape html table from long list of urls and make one list of dataframes

0 Answers0