R getURL() returning empty string

Question

Sorry about the title but I couldn't think how to phrase this one.

I am trying to scrape webpages for a study - they will be subjected to a battery of linguistic tests eventually.

In the meantime...

    require(RCurl)
    url1 <- "http://www.coindesk.com/bitinstants-charlie-shrem-sees-bitcoin-battles-ahead"  
    url2 <- "http://www.coindesk.com/terms-conditions/"

    html <- getURL(url1)   # read in page contents
    html
    [1] ""

    html <- getURL(url2)   # read in page contents
    html
    [1] "<!DOCTYPE html>\r\n<!--[if lt IE 7]> <html class=\"no-js ie ie6 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 7]>    <html class=\"no-js ie ie7 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 8]>......."

So given two URLs, each for different pages on the same website - the request for url1returns an empty string. But url2 works just fine.

I've tried adding a browser agent as;

html <- getURL(url1, .opts=list(useragent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"))   # read in page contents

but that makes no difference, still an empty string.

I'm only on day two of learning R and now I AM STUMPED!

Can anyone suggest a reason why this is happening or a solution,

score 4 · Answer 1 · answered Aug 22 '14 at 20:03

To get this to work with RCurl, you need to use

getURL(url1, .opts=curlOptions(followlocation = TRUE))

I wish I could tell you why. When looking at the requests in Chrome I don't see any redirects, but maybe i'm missing something.

Note that you could also use the httr library

library(httr)
GET(url1)

Rich Scriven · Accepted Answer · 2014-08-22T18:06:33.077

0

I'm not exactly sure why getURL isn't working on that content, but htmlParse from package XML seems to get the content okay.

Try this:

> library(XML)
> htmlParse(url1)

edited Aug 22 '14 at 18:06

answered Aug 22 '14 at 17:58

Rich Scriven

97,041
11
181
245

oh. er. that's odd. so all the worked examples I've looked at so far have used geturl, followed by an htmlparse. is there any real reason to use the geturl then? – BarneyC Aug 22 '14 at 18:09
Sometimes `htmlParse` can have trouble getting URL content, so then they advise to use `getURL`. It is a bit strange, but there are questions about it in the R-help. I'll see if I can find one for you – Rich Scriven Aug 22 '14 at 18:12

R getURL() returning empty string

2 Answers2

Linked