List XML files in web server directory and subdirectories

Question

I'm trying to get a list of all the XML documents in a web server directory (and all its subdirectories).

I've tried these examples:

library(XML)  
url <- "https://cmgds.marine.usgs.gov/metadata/pcmsc/"
getHTMLLinks(url)

Returns: character(0) Warning message: XML content does not seem to be XML

Two:

readHTMLTable(url)

Returns the same error.

I've tried other sites as well, like those included in the examples. I saw some SO questions (example) about this error saying to change https to http. When I do that I get Error: failed to load external entity.

Is there a way I can get a list of all the XML files at that URL and all the subdirectories using R?

Mako212 · Answer 1 · 2018-01-25T21:48:40.043

2

To get the raw html from the page:

require(rvest)

url <- "https://cmgds.marine.usgs.gov/metadata/pcmsc/"

html <- read_html(url)

Then, we'll get all the links using html_nodes. The names are truncated, so we need to get the href attribute rather than just using html_table().

data <- html %>% html_nodes("a") %>% html_attr('href')

edited Jan 25 '18 at 21:48

answered Jan 24 '18 at 23:02

Mako212

6,787
1
18
37

Surrounding your code with some explanation would improve your answer. – zx485 Jan 25 '18 at 08:28
This seems to truncate file names that are too long, i.e. the list that's returned is truncated as it appears in a web browser. Do you know if there's a way around that? Also, is it possible to get into the subdirectories, other than looping through them all with extra code? – Evan Jan 25 '18 at 17:32
@Bird See my update. For the sub-directories I'm not sure. – Mako212 Jan 25 '18 at 21:49

List XML files in web server directory and subdirectories

1 Answers1