1

I am trying to create a big data frame made of multiple xml pages.

I am able to create a data frame for a single page:

library(RCurl)
US_GrossiOS200<-getURL("https://rss.itunes.apple.com/api/v1/us/ios-apps/top-grossing/all/200/explicit.rss")

library(XML)
library(plyr)
USGr200.xml<-xmlTreeParse(US_GrossiOS200)
USGr200<-ldply(xmlToList(USGr200.xml), data.frame)

There are potentially hundreds of URLs that I want to scrape. To automate the process, I thought of creating a CSV file with all the URLs I'd like to scrape. Here's an example with the first 2 lines of a listofurls.csv list (2 rows, 1 column):

1 https://rss.itunes.apple.com/api/v1/us/ios-apps/new-games-we-love/all/200/explicit.rss
2 https://rss.itunes.apple.com/api/v1/us/ios-apps/top-free/all/200/explicit.rss

At this stage, I am able to write the content of both pages on the console (I'm using RStudio) using getURL(CSV$URL) with CSV <- read.csv(listofurls.csv).

The str() of the getURL(CSV$URL) output read as follows:

Named chr [1:2] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<rss version=\"2.0\" xmlns:atom=\"http://www.w3.org/2005/Atom\">\n "| __truncated__ ...
- attr(*, "names")= chr [1:2] "https://rss.itunes.apple.com/api/v1/us/ios-apps/new-games-we-love/all/200/explicit.rss" "https://rss.itunes.apple.com/api/v1/us/ios-apps/top-free/all/200/explicit.rss"  ...

I'm then trying to use xmlTreeParse(), but I get the following error:

XML declaration allowed only at the start of the document

Extra content at the end of the document

Error: 1: XML declaration allowed only at the start of the document

2: Extra content at the end of the document

Suggestions?

Community
  • 1
  • 1
euclideans
  • 75
  • 1
  • 11

0 Answers0