8

I am trying to scrape a large amount of web pages to later analyse them. Since the number of URLs is huge, I had decided to use the parallel package along with XML.

Specifically, I am using the htmlParse() function from XML, which works fine when used with sapply, but generates empty objects of class HTMLInternalDocument when used with parSapply.

url1<- "http://forums.philosophyforums.com/threads/senses-of-truth-63636.html"
url2<- "http://forums.philosophyforums.com/threads/the-limits-of-my-language-impossibly-mean-the-limits-of-my-world-62183.html"
url3<- "http://forums.philosophyforums.com/threads/how-language-models-reality-63487.html"

myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
return(ok)
}

urls<- c(url1,url2,url3)

#Works
output1<- sapply(urls,function(x)htmlParse(x))
str(output1[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output1[[1]]


#Doesn't work
myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
stopCluster(cl)
return(ok)
}

output2<- myFunction(urls)
str(output2[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output2[[1]]
#empty

Thanks.

info_seekeR
  • 1,296
  • 1
  • 15
  • 33
  • 3
    Someone more knowledgeable will hopefully chime in, but my intuition is that parallelizing this (as currently designed) may not be that efficient because you're calling the websites directly in `htmlParse` and all of your cores likely share a single connection to the internet. You may want to look at `RCurl` for [asynchronous downloads, which are allegedly more efficient](http://www.inside-r.org/packages/cran/RCurl/docs/getURIAsynchronous). – Thomas Oct 20 '13 at 08:53
  • @Thomas Thanks. As in the previous questions you've helped me out, I welcome your suggestions/comments. I shall look into RCurl too. – info_seekeR Oct 20 '13 at 09:05
  • 2
    Also note that if your individual webscrapes don't take that long (order ms) the overhead of parallelisation will cause it to take longer than simply processing in serie. – Paul Hiemstra Oct 20 '13 at 09:38
  • @PaulHiemstra You are very correct. Thank you for the comment. Though in the original task, I am working with around 500 urls, and I checked to make sure that parSapply was twice faster than sapply. Only that the results are strange as shown in the toy example. – info_seekeR Oct 20 '13 at 10:04
  • 1
    I had the same problem! I might put a bounty on this to get it answered... – stanekam Dec 23 '13 at 22:30
  • It is up to you; I'd be thankful if you could obtain an answer by putting a bounty. However, I feel Thomas may be correct in suggesting the use of RCurl. The problem is still worth looking into! – info_seekeR Dec 24 '13 at 22:10
  • Did RCurl work parallelized? – stanekam Dec 30 '13 at 18:49
  • @iShouldUseAName I hadn't had the opportunity to try it, unfortunately, as I moved to other parts of the project. But as suggested by @Thomas and shown by @agstudy, the way to go is `getURIAsynchronous` @iShouldUseAName: thanks for adding the bounty! – info_seekeR Jan 01 '14 at 19:13

1 Answers1

12

You can use getURIAsynchronous from Rcurl package that allows the caller to specify multiple URIs to download at the same time.

library(RCurl)
library(XML)
get.asynch <- function(urls){
  txt <- getURIAsynchronous(urls)
  ## this part can be easily parallelized 
  ## I am juste using lapply here as first attempt
  res <- lapply(txt,function(x){
    doc <- htmlParse(x,asText=TRUE)
    xpathSApply(doc,"/html/body/h2[2]",xmlValue)
  })
}

get.synch <- function(urls){
  lapply(urls,function(x){
    doc <- htmlParse(x)
    res2 <- xpathSApply(doc,"/html/body/h2[2]",xmlValue)
    res2
  })}

Here some benchmarking for 100 urls you divide the parsing time by a factor of 2.

library(microbenchmark)
uris = c("http://www.omegahat.org/RCurl/index.html")
urls <- replicate(100,uris)
microbenchmark(get.asynch(urls),get.synch(urls),times=1)

Unit: seconds
             expr      min       lq   median       uq      max neval
 get.asynch(urls) 22.53783 22.53783 22.53783 22.53783 22.53783     1
  get.synch(urls) 39.50615 39.50615 39.50615 39.50615 39.50615     1
agstudy
  • 119,832
  • 17
  • 199
  • 261