53

I am trying to get this XML file, but am unable to. I checked the other solutions in the same topic, but I couldn't understand. I am a R newbie.

> library(XML)
> fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
> doc <- xmlTreeParse(fileURL,useInternal=TRUE)

Error: XML content does not seem to be XML: 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml'

Can you please help?

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Paste the link into chrome address bar and get the message "This XML file does not appear to have any style information associated with it." It then shows the document tree. – Marichyasana Nov 05 '14 at 23:41

5 Answers5

52

Remove the s from https

library(XML)

fileURL<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
doc <- xmlTreeParse(sub("s", "", fileURL), useInternal = TRUE)
class(doc)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
50

You can use RCurl to fetch the content and then XML seems to be able to handle it

library(XML)
library(RCurl)
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xData <- getURL(fileURL)
doc <- xmlParse(xData)
jdharrison
  • 30,085
  • 4
  • 77
  • 89
  • Thanks @jdharrison for the reply. I got the following Error Message when I typed the fourth line: XData <- getURL(fileURL). **Error in function (type, msg, asError = TRUE): SSL certificate problem, verify that the CA cert is OK. Details: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed** What does it mean? –  May 12 '14 at 06:53
  • 4
    @ArpanGanguli Use `xData <- getURL(fileURL, ssl.verifypeer = FALSE)`. The error is explained in depth at http://www.omegahat.org/RCurl/FAQ.html – jdharrison May 12 '14 at 07:04
  • should that be omegahat.net ?? – Sean May 23 '16 at 08:37
  • 1
    @Sean yes it is now .net omegahat.net/RCurl/FAQ.html – jdharrison May 23 '16 at 20:39
15

xmlTreeParse does not support https.

You can load the data with getURL (from RCurl) and then parse it.

kaarefc
  • 206
  • 1
  • 6
8

Answer is at http://www.omegahat.net/RCurl/installed/RCurl/html/getURL.html. Key point is to use ssl.verifyPeer=FALSE with getURL if certificate error is shown.

library (RCurl)
library (XML)
curlVersion()$features
curlVersion()$protocol
##These should show ssl and https. I can see these on windows 8.1 at least. 
##It may differ on other OSes.

temp <- getURL("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml", ssl.verifyPeer=FALSE)
DFX <- xmlTreeParse(temp,useInternal = TRUE)

If ssl or https capability is not shown by libcurl functions, check using Rcurl with HTTPs.

Community
  • 1
  • 1
Atul Kumar
  • 366
  • 4
  • 15
2

Using download.file avoids introducing another dependency. The following function returns the output of XML::xmlParse also when the URL starts with https. It caches the file to a temporary directory so that it will be downloaded only once if this function is called many times during an R session.

xml_parse <- function(xml_url){
    # Temporary copy of the xml file, valid for this R session
    xml_temp_file <- file.path(tempdir(), basename(xml_url))
    if (!file.exists(xml_temp_file)){
        print(sprintf("Downloading to %s.", xml_temp_file))
        download.file(xml_url, xml_temp_file)
    }
    return(XML::xmlParse(xml_temp_file))
}

# Example
xml_content = xml_parse("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml")
Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110