1

I'm trying to scrape xml from this url: data.gov.in/sites/default/files/Potato_2013.xml using R (version : 3.1.0)

Tried

library(XML)

url<- "data.gov.in/sites/default/files/Potato_2013.xml"

doc<- xmlParse(url,useInternalNodes=TRUE)

But am getting an error saying

Error: The XML content does not seem to be xml:

Any way to fix this??

typing just

doc<-xmlParse(url)

gives a document is empty error.

I am looking to extract the values for the nodes State, Commodity, Arrival Date etc.

Thanks!

1 Answers1

0

This is taken from this SO question:

library(XML)
library(RCurl)
##
url<- "data.gov.in/sites/default/files/Potato_2013.xml"
Data <- getURL(url)
doc <- xmlParse(Data)
Community
  • 1
  • 1
nrussell
  • 18,382
  • 4
  • 47
  • 60
  • @Spacedman yes good call; this is a very large file and taking a while to operate on. – nrussell Jul 13 '14 at 15:04
  • I think `xmlParse(url)` will work if the OP just sticks 'http://' on the URL, but I'm on a slow net at the moment... – Spacedman Jul 13 '14 at 15:05
  • Also, this is the worst kind of XML. It has no line breaks. It looks like a SOAP response. It looks like regular tabular data that would be about 1Mb as a CSV file. By splitting into lines on the tag its almost trivially readable.
    – Spacedman Jul 13 '14 at 15:39
  • @Spacedman You should post that as an answer. This file is unreasonably large - I had `xmlToList()` running on `doc` but after about 20 minutes I just killed the command because it was taking forever. – nrussell Jul 13 '14 at 15:44
  • Thanks a ton! Works. And yes getURL does take a while to finish – user3834109 Jul 13 '14 at 15:51