0

I'm lapplying over a bunch of URLs to get some data out and in this the readLines(<URL>) command works fine. When I switch to sfLapply the code is unable to read the webpage. Anyone know why? Example below

library(snowfall)
library(rlecuyer)
# looping through each combination of fruit and dish 

sfInit(parallel = T, cpus  = as.integer(Sys.getenv('NUMBER_OF_PROCESSORS')) - 1)
dtData = # lapply(
sfLapply(
   c('apple','mango','banana'),
   function(fruit) {
      cat(fruit,'\n')
      lapply(
         c('pie','shake'),
         function(dish) {
            # getting the data 
            vcTemp = readLines(paste0('https://www.google.co.in/search?q=a',dish,' ',fruit))
         }
      )
   }
)

sfStop()

The error message I get is - Error in checkForRemoteErrors(val) : 3 nodes produced errors; first error: cannot open the connection

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rpart_4.1-8      plyr_1.8.1       gridExtra_0.9.1  reshape2_1.4.1   clue_0.3-49      scales_0.2.4    
 [7] fpc_2.1-9        ggplot2_1.0.0    rlecuyer_0.3-3   snowfall_1.84-6  snow_0.3-13      data.table_1.9.4

loaded via a namespace (and not attached):
 [1] chron_2.3-45      class_7.3-12      cluster_2.0.1     colorspace_1.2-4  DEoptimR_1.0-2    digest_0.6.8     
 [7] diptest_0.75-6    flexmix_2.3-13    gtable_0.1.2      kernlab_0.9-20    labeling_0.3      lattice_0.20-30  
[13] MASS_7.3-39       mclust_4.4        modeltools_0.2-21 munsell_0.4.2     mvtnorm_1.0-2     nnet_7.3-9       
[19] prabclus_2.2-6    proto_0.3-10      Rcpp_0.11.4       robustbase_0.92-3 stats4_3.1.2      stringr_0.6.2    
[25] tools_3.1.2       trimcluster_0.1-2
TheComeOnMan
  • 12,535
  • 8
  • 39
  • 54
  • You need to use one of following packages for web scraping (depending on your need): `rvest`, `httr`, `XML`, `xml2`, `RCurl`,`curl`, `RSelenium` – user227710 Jul 01 '15 at 16:24
  • What's wrong with `readLines`? – TheComeOnMan Jul 01 '15 at 16:50
  • What's the exact error you are getting? And you say it works perfectly if you swap `sfLapply` with `apply`? What OS and R version are you using? – MrFlick Jul 01 '15 at 16:55
  • @MrFlick - Yep, just run the above thing with the `lapply` instead of the `sfLapply` and it works fine. Added sessionInfo to the question. – TheComeOnMan Jul 01 '15 at 17:08
  • It's still unclear to me what the exact error message you are getting. Can you paste that as well? I can not replicate any error with this code. – MrFlick Jul 01 '15 at 17:26
  • @MrFlick - added the error message to the question. – TheComeOnMan Jul 02 '15 at 00:46
  • In the past, I ran into similar issues because some websites have a maximum number of simultaneous connections they will allow from a same ip address. Did you try running the code with only 2-4 processors? – Lucas Fortini Oct 03 '15 at 16:40

0 Answers0