1

Sorry about the title but I couldn't think how to phrase this one.

I am trying to scrape webpages for a study - they will be subjected to a battery of linguistic tests eventually.

In the meantime...

    require(RCurl)
    url1 <- "http://www.coindesk.com/bitinstants-charlie-shrem-sees-bitcoin-battles-ahead"  
    url2 <- "http://www.coindesk.com/terms-conditions/"

    html <- getURL(url1)   # read in page contents
    html
    [1] ""

    html <- getURL(url2)   # read in page contents
    html
    [1] "<!DOCTYPE html>\r\n<!--[if lt IE 7]> <html class=\"no-js ie ie6 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 7]>    <html class=\"no-js ie ie7 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 8]>......."

So given two URLs, each for different pages on the same website - the request for url1returns an empty string. But url2 works just fine.

I've tried adding a browser agent as;

html <- getURL(url1, .opts=list(useragent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"))   # read in page contents

but that makes no difference, still an empty string.

I'm only on day two of learning R and now I AM STUMPED!

Can anyone suggest a reason why this is happening or a solution,

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
BarneyC
  • 529
  • 4
  • 17

2 Answers2

4

To get this to work with RCurl, you need to use

getURL(url1, .opts=curlOptions(followlocation = TRUE))

I wish I could tell you why. When looking at the requests in Chrome I don't see any redirects, but maybe i'm missing something.

Note that you could also use the httr library

library(httr)
GET(url1)
MrFlick
  • 195,160
  • 17
  • 277
  • 295
0

I'm not exactly sure why getURL isn't working on that content, but htmlParse from package XML seems to get the content okay.

Try this:

> library(XML)
> htmlParse(url1)
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • oh. er. that's odd. so all the worked examples I've looked at so far have used geturl, followed by an htmlparse. is there any real reason to use the geturl then? – BarneyC Aug 22 '14 at 18:09
  • Sometimes `htmlParse` can have trouble getting URL content, so then they advise to use `getURL`. It is a bit strange, but there are questions about it in the R-help. I'll see if I can find one for you – Rich Scriven Aug 22 '14 at 18:12