Reading XML using R - Error

Question

I'm trying to scrape xml from this url: data.gov.in/sites/default/files/Potato_2013.xml using R (version : 3.1.0)

Tried

library(XML)

url<- "data.gov.in/sites/default/files/Potato_2013.xml"

doc<- xmlParse(url,useInternalNodes=TRUE)

But am getting an error saying

Error: The XML content does not seem to be xml:

Any way to fix this??

typing just

doc<-xmlParse(url)

gives a document is empty error.

I am looking to extract the values for the nodes State, Commodity, Arrival Date etc.

Thanks!

score 0 · Answer 1 · edited May 23 '17 at 12:11

0

This is taken from this SO question:

library(XML)
library(RCurl)
##
url<- "data.gov.in/sites/default/files/Potato_2013.xml"
Data <- getURL(url)
doc <- xmlParse(Data)

edited May 23 '17 at 12:11

Community

answered Jul 13 '14 at 14:59

nrussell

@Spacedman yes good call; this is a very large file and taking a while to operate on. – nrussell Jul 13 '14 at 15:04
I think `xmlParse(url)` will work if the OP just sticks 'http://' on the URL, but I'm on a slow net at the moment... – Spacedman Jul 13 '14 at 15:05
Also, this is the worst kind of XML. It has no line breaks. It looks like a SOAP response. It looks like regular tabular data that would be about 1Mb as a CSV file. By splitting into lines on the tag its almost trivially readable.
– Spacedman Jul 13 '14 at 15:39
@Spacedman You should post that as an answer. This file is unreasonably large - I had `xmlToList()` running on `doc` but after about 20 minutes I just killed the command because it was taking forever. – nrussell Jul 13 '14 at 15:44
Thanks a ton! Works. And yes getURL does take a while to finish – user3834109 Jul 13 '14 at 15:51

1 Answers1