1

I'm trying to get a list of all the XML documents in a web server directory (and all its subdirectories).

I've tried these examples:

One:

library(XML)  
url <- "https://cmgds.marine.usgs.gov/metadata/pcmsc/"
getHTMLLinks(url)

Returns: character(0) Warning message: XML content does not seem to be XML

Two:

readHTMLTable(url)

Returns the same error.

I've tried other sites as well, like those included in the examples. I saw some SO questions (example) about this error saying to change https to http. When I do that I get Error: failed to load external entity.

Is there a way I can get a list of all the XML files at that URL and all the subdirectories using R?

Evan
  • 1,960
  • 4
  • 26
  • 54

1 Answers1

2

To get the raw html from the page:

require(rvest)

url <- "https://cmgds.marine.usgs.gov/metadata/pcmsc/"

html <- read_html(url)

Then, we'll get all the links using html_nodes. The names are truncated, so we need to get the href attribute rather than just using html_table().

data <- html %>% html_nodes("a") %>% html_attr('href')
Mako212
  • 6,787
  • 1
  • 18
  • 37
  • Surrounding your code with some explanation would improve your answer. – zx485 Jan 25 '18 at 08:28
  • This seems to truncate file names that are too long, i.e. the list that's returned is truncated as it appears in a web browser. Do you know if there's a way around that? Also, is it possible to get into the subdirectories, other than looping through them all with extra code? – Evan Jan 25 '18 at 17:32
  • @Bird See my update. For the sub-directories I'm not sure. – Mako212 Jan 25 '18 at 21:49