Using parallelisation to scrape web pages with R

Question

I am trying to scrape a large amount of web pages to later analyse them. Since the number of URLs is huge, I had decided to use the parallel package along with XML.

Specifically, I am using the htmlParse() function from XML, which works fine when used with sapply, but generates empty objects of class HTMLInternalDocument when used with parSapply.

url1<- "http://forums.philosophyforums.com/threads/senses-of-truth-63636.html"
url2<- "http://forums.philosophyforums.com/threads/the-limits-of-my-language-impossibly-mean-the-limits-of-my-world-62183.html"
url3<- "http://forums.philosophyforums.com/threads/how-language-models-reality-63487.html"

myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
return(ok)
}

urls<- c(url1,url2,url3)

#Works
output1<- sapply(urls,function(x)htmlParse(x))
str(output1[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output1[[1]]


#Doesn't work
myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
stopCluster(cl)
return(ok)
}

output2<- myFunction(urls)
str(output2[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output2[[1]]
#empty

Thanks.

Someone more knowledgeable will hopefully chime in, but my intuition is that parallelizing this (as currently designed) may not be that efficient because you're calling the websites directly in `htmlParse` and all of your cores likely share a single connection to the internet. You may want to look at `RCurl` for [asynchronous downloads, which are allegedly more efficient](http://www.inside-r.org/packages/cran/RCurl/docs/getURIAsynchronous). — Thomas, Oct 20 '13 at 08:53
@Thomas Thanks. As in the previous questions you've helped me out, I welcome your suggestions/comments. I shall look into RCurl too. — info_seekeR, Oct 20 '13 at 09:05
Also note that if your individual webscrapes don't take that long (order ms) the overhead of parallelisation will cause it to take longer than simply processing in serie. — Paul Hiemstra, Oct 20 '13 at 09:38
@PaulHiemstra You are very correct. Thank you for the comment. Though in the original task, I am working with around 500 urls, and I checked to make sure that parSapply was twice faster than sapply. Only that the results are strange as shown in the toy example. — info_seekeR, Oct 20 '13 at 10:04
I had the same problem! I might put a bounty on this to get it answered... — stanekam, Dec 23 '13 at 22:30
It is up to you; I'd be thankful if you could obtain an answer by putting a bounty. However, I feel Thomas may be correct in suggesting the use of RCurl. The problem is still worth looking into! — info_seekeR, Dec 24 '13 at 22:10
@iShouldUseAName I hadn't had the opportunity to try it, unfortunately, as I moved to other parts of the project. But as suggested by @Thomas and shown by @agstudy, the way to go is `getURIAsynchronous` @iShouldUseAName: thanks for adding the bounty! — info_seekeR, Jan 01 '14 at 19:13

score 12 · Accepted Answer · answered Dec 31 '13 at 16:07

You can use getURIAsynchronous from Rcurl package that allows the caller to specify multiple URIs to download at the same time.

library(RCurl)
library(XML)
get.asynch <- function(urls){
  txt <- getURIAsynchronous(urls)
  ## this part can be easily parallelized 
  ## I am juste using lapply here as first attempt
  res <- lapply(txt,function(x){
    doc <- htmlParse(x,asText=TRUE)
    xpathSApply(doc,"/html/body/h2[2]",xmlValue)
  })
}

get.synch <- function(urls){
  lapply(urls,function(x){
    doc <- htmlParse(x)
    res2 <- xpathSApply(doc,"/html/body/h2[2]",xmlValue)
    res2
  })}

Here some benchmarking for 100 urls you divide the parsing time by a factor of 2.

library(microbenchmark)
uris = c("http://www.omegahat.org/RCurl/index.html")
urls <- replicate(100,uris)
microbenchmark(get.asynch(urls),get.synch(urls),times=1)

Unit: seconds
             expr      min       lq   median       uq      max neval
 get.asynch(urls) 22.53783 22.53783 22.53783 22.53783 22.53783     1
  get.synch(urls) 39.50615 39.50615 39.50615 39.50615 39.50615     1

Using parallelisation to scrape web pages with R

1 Answers1

Linked