2

I am attempting to scrape values from a webpage using rvest in parallel with foreach and doParallel. Specifically, I am using a real estate property identifier called a TMK to retrieve the property's census tract number from the website.

In the sample code below, the foreach loop gives the desired results (a vector containing tract numbers) when run with %do%, but not with %dopar%.

require(rvest); require(foreach); require(doParallel)
registerDoParallel(cores = 4)

# sample input values used to generate html
tmklist <- c(91136088, 73006073, 92023027, 45061064)

# read html for each TMK
# DOES NOT PRODUCE DESIRED RESULT WHEN USED WITH %dopar%
tmkhtml <- foreach(i = seq_along(tmklist)) %do% {
  read_html(paste0(paste0('http://gis.hicentral.com/pubwebsite/TMKDetails.aspx?tmk=', tmklist[[i]]),'&lyrLst=0|0|0|0|0|0|0|0|0|0|0|0|0|13|0|15|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|lblsaerial2008&unit=0000&address='))
}

# scrape census tract number from html page
# loop returns vector of chr(0) if %dopar% used instead of %do%
tract.num <- list()

for(i in 1:length(tmkhtml)){
  tract.num[[i]] <- html_text(html_nodes(tmkhtml[[i]], '#lblTrackNumber'))   
}

I've (maybe incorrectly) deduced that the parallel backend is to blame, but I've used it many times before in other applications and can't seem to find the issue.

ndem763
  • 320
  • 1
  • 11
  • This is a good question but at the same time just a warning -- be careful not to get your IP banned by the sites you're scraping for too many simultaneous requests. It's happens a lot. – Hack-R Jul 08 '16 at 01:06

1 Answers1

3

Its might be that when you try and read each page using more than one instance of R, all of them need the rvest package to use rvest::read_html. Try loading the library in the foreach such as:

# read html for each TMK
tmkhtml <- foreach(i = seq_along(tmklist)) %dopar% {
library(rvest)
  read_html(paste0(paste0('http://gis.hicentral.com/pubwebsite/TMKDetails.aspx?tmk=', tmklist[[i]]),'&lyrLst=0|0|0|0|0|0|0|0|0|0|0|0|0|13|0|15|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|lblsaerial2008&unit=0000&address='))
}
Hanjo Odendaal
  • 1,395
  • 2
  • 13
  • 32