RCurl getURL with loop - link to a PDF kills looping

Question

I've been puzzling this long enough now and can't seem to figure out how to get around it. Easiest to give working dummy code:

require(RCurl)
require(XML)

#set a bunch of options for curl
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Firefox/23.0" 
curl = getCurlHandle()
curlSetOpt(
  cookiejar = 'cookies.txt' ,
  useragent = agent,
  followlocation = TRUE ,
  autoreferer = TRUE ,
  httpauth = 1L, # "basic" http authorization version -- this seems to make a difference for India servers
  curl = curl
)


list1 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933277.cms')

#note list2 has a new link inserted in 2nd position; this is the link that kills the following getURL calls
list2 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933277.cms')



for ( i in seq( list1 ) ){
  print(list1[i])
  html <-
    try( getURL(
      list1[i],
      maxredirs = as.integer(20),
      followlocation = TRUE,
      curl = curl
    ),TRUE)
  if (class (html) == "try-error") {
    print(paste("error accessing",list1[i]))
    rm(html)
    gc()
    next
  } else {
    print('success')
  }
}


gc()

for ( i in seq( list2 ) ){
  print(list2[i])
  html <-
    try( getURL(
      list2[i],
      maxredirs = as.integer(20),
      followlocation = TRUE,
      curl = curl
    ),TRUE)
  if (class (html) == "try-error") {
    print(paste("error accessing",list2[i]))
    rm(html)
    gc()
    next
  } else {
    print('success')
  }
}

This should be able to run with RCurl and XML libraries installed. The point being that when I insert http://timesofindia.indiatimes.com//articleshow/2933019.cms into the second position in the list, it kills the success of the rest of the loop (other links are the same). This happens (in this and other circumstances consistently) when the link contains a PDF (check to see).

Any thoughts on how to fix this so getting a link that contains a PDF doesn't kill my loop? As you can see, I have tried to clear out the potentially offending object, gc() all over the place, etc. but I can't figure out why a PDF kills my loop.

Just to check, here is my output for the two for loops:

    #[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
    #[1] "success"

and

    #[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933019.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933019.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933277.cms"

score 0 · Answer 1 · answered Aug 24 '14 at 01:51

0

You might find it easier to use httr. It wraps RCurl and sets the options you need by default. Here's the equivalent code with httr:

require(httr)

urls <- c(
  'http://timesofindia.indiatimes.com//articleshow/2933112.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933277.cms'
)

responses <- lapply(urls, GET)
sapply(responses, http_status)

sapply(responses, function(x) headers(x)$`content-type`)

answered Aug 24 '14 at 01:51

hadley

102,019
32
183
245

Thanks for this---httr is great to know of. Also, your answer showed me that it is possible to determine the type of document contained in the URL, which I am now working on to skip over PDFs using getURLContent(). – SOConnell Aug 24 '14 at 17:49
Follow-up posted here: http://stackoverflow.com/questions/25474682/rcurl-geturlcontent-detect-content-type-through-final-redirect – SOConnell Aug 24 '14 at 18:04

RCurl getURL with loop - link to a PDF kills looping

1 Answers1

Linked