6

I am trying to scrape all bills from two pages on the website of the French lower chamber of parliament. The pages cover 2002-2012 and represent less than 1,000 bills each.

For this, I scrape with getURL through this loop:

b <- "http://www.assemblee-nationale.fr" # base
l <- c("12","13") # legislature id

lapply(l, FUN = function(x) {
  print(data <- paste(b, x, "documents/index-dossier.asp", sep = "/"))

  # scrape
  data <- getURL(data); data <- readLines(tc <- textConnection(data)); close(tc)
  data <- unlist(str_extract_all(data, "dossiers/[[:alnum:]_-]+.asp"))
  data <- paste(b, x, data, sep = "/")
  data <- getURL(data)
  write.table(data,file=n <- paste("raw_an",x,".txt",sep="")); str(n)
})

Is there any way to optimise the getURL() function here? I cannot seem to use concurrent downloading by passing the async=TRUE option, which gives me the same error every time:

Error in function (type, msg, asError = TRUE)  : 
Failed to connect to 0.0.0.12: No route to host

Any ideas? Thanks!

Fr.
  • 2,865
  • 2
  • 24
  • 44
  • 1
    `async=TRUE` is already the default if you give several URLs -- but opening more than 500 simultaneous connections to the same website may not be a good idea... – Vincent Zoonekynd Apr 09 '12 at 03:34
  • Alright. Well, I can't seem to change much to how `getURL()` works so far. – Fr. Apr 09 '12 at 11:23

2 Answers2

1

Try mclapply {multicore} instead of lapply.

"mclapply is a parallelized version of lapply, it returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X." (http://www.rforge.net/doc/packages/multicore/mclapply.html)

If that doesn't work, you may get better performance using the XML package. Functions like xmlTreeParse use asynchronous calling.

"Note that xmlTreeParse does allow a hybrid style of processing that allows us to apply handlers to nodes in the tree as they are being converted to R objects. This is a style of event-driven or asynchronous calling." (http://www.inside-r.org/packages/cran/XML/docs/xmlEventParse)

rsoren
  • 4,036
  • 3
  • 26
  • 37
  • Nice! I did not know `mclapply` but that's a cool suggestion. Since I asked the question, I discovered your second option (using `XML` instead of `getURL`), and that works very well. In your opinion, would it be overkill to parallelize a loop where I parse HTML with `htmlParse`? – Fr. Mar 20 '14 at 09:29
  • That would probably scrape the data faster, but writing the code might be a net loss of time if your speed gains aren't very significant. It depends on the size of your data set. – rsoren Mar 26 '14 at 20:13
  • It's not big enough to justify that. Thanks :) – Fr. Mar 26 '14 at 22:05
-5

Why use R? For big scraping jobs you are better off using something already developed for the task. I've had good results with Down Them All, a browser add on. Just tell it where to start, how deep to go, what patterns to follow, and where to dump the HTML.

Then use R to read the data from the HTML files.

Advantages are massive - these add-ons are developed especially for the task so they will do multiple downloads (controllable by you), they will send the right headers so your next question won't be 'how do I set the user agent string with RCurl?', and they can cope with retrying when some of the downloads fail, which they inevitably do.

Of course the disadvantage is that you can't easily start this process automatically, in which case maybe you'd be better off with 'curl' on the command line, or some other command-line mirroring utility.

Honestly, you've got better things to do with your time than write website code in R...

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • 3
    I use R for the analysis that follows the data extraction! I'm making the operation entirely replicable, so a third app will not do. I'm open to suggestions with Python, for example, though. – Fr. Apr 09 '12 at 11:23
  • Why is Python okay but using 'curl' on the command line (possibly called form R via system) not? You are just going to try and duplicate the functionality of command-line curl via python or R and that's a big pointless effort. You can still use R, just you do it on the downloaded and saved files. Good luck with replicable work based on scraping from a web site... – Spacedman Apr 09 '12 at 11:41
  • 4
    Oh, `curl` would do. There's nice scraping code out here for Ruby and Python, and for bash, of course. Now, R is a practical way to share scraping code along the stats code, especially for users who do not use code on a regular basis. As for replicability, it's a parliamentary website, their archives tend to last. – Fr. Apr 09 '12 at 11:58