I am attempting to scrape values from a webpage using rvest
in parallel with foreach
and doParallel
. Specifically, I am using a real estate property identifier called a TMK to retrieve the property's census tract number from the website.
In the sample code below, the foreach
loop gives the desired results (a vector containing tract numbers) when run with %do%
, but not with %dopar%
.
require(rvest); require(foreach); require(doParallel)
registerDoParallel(cores = 4)
# sample input values used to generate html
tmklist <- c(91136088, 73006073, 92023027, 45061064)
# read html for each TMK
# DOES NOT PRODUCE DESIRED RESULT WHEN USED WITH %dopar%
tmkhtml <- foreach(i = seq_along(tmklist)) %do% {
read_html(paste0(paste0('http://gis.hicentral.com/pubwebsite/TMKDetails.aspx?tmk=', tmklist[[i]]),'&lyrLst=0|0|0|0|0|0|0|0|0|0|0|0|0|13|0|15|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|lblsaerial2008&unit=0000&address='))
}
# scrape census tract number from html page
# loop returns vector of chr(0) if %dopar% used instead of %do%
tract.num <- list()
for(i in 1:length(tmkhtml)){
tract.num[[i]] <- html_text(html_nodes(tmkhtml[[i]], '#lblTrackNumber'))
}
I've (maybe incorrectly) deduced that the parallel backend is to blame, but I've used it many times before in other applications and can't seem to find the issue.