Sorry about the title but I couldn't think how to phrase this one.
I am trying to scrape webpages for a study - they will be subjected to a battery of linguistic tests eventually.
In the meantime...
require(RCurl)
url1 <- "http://www.coindesk.com/bitinstants-charlie-shrem-sees-bitcoin-battles-ahead"
url2 <- "http://www.coindesk.com/terms-conditions/"
html <- getURL(url1) # read in page contents
html
[1] ""
html <- getURL(url2) # read in page contents
html
[1] "<!DOCTYPE html>\r\n<!--[if lt IE 7]> <html class=\"no-js ie ie6 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 7]> <html class=\"no-js ie ie7 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 8]>......."
So given two URLs, each for different pages on the same website - the request for url1
returns an empty string. But url2
works just fine.
I've tried adding a browser agent as;
html <- getURL(url1, .opts=list(useragent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13")) # read in page contents
but that makes no difference, still an empty string.
I'm only on day two of learning R and now I AM STUMPED!
Can anyone suggest a reason why this is happening or a solution,