Get website directory listing in an R vector using RCurl

Question

I'm trying to get the list of files in a directory on a website. Is there a way to do this similar to the dir() or list.files() commands for local directory listing? I can connect to the website using RCurl (I need it because I need an SSL connection over HTTPS):

library(RCurl)    
text=getURL(*some https website*
,ssl.verifypeer = FALSE
,dirlistonly = TRUE)

But this creates an HTML file with images, hyperlinks, etc. of a list of files, but I just need an R vector of files as you would obtain with dir(). Is this possible? Or would I have to do HTML parsing to extract the filenames? Sounds like a complicated approach for a simple problem.

Thanks,

EDIT: if you can get it to work with http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV7/ then you'll see what I mean.

Dag Hjermann · Answer 1 · 2019-06-25T09:30:09.823

6

This is last example in the help file for getURL (with an updated URL):

url <- 'ftp://speedtest.tele2.net/'
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)


# Deal with newlines as \n or \r\n. (BDR)
# Or alternatively, instruct libcurl to change \n’s to \r\n’s for us with crlf = TRUE
# filenames = getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE)
filenames = paste(url, strsplit(filenames, "\r*\n")[[1]], sep = "")

Does that solve your problem?

edited Jun 25 '19 at 09:30

answered Jun 19 '13 at 09:21

Dag Hjermann

1,960
14
18

For the ftp site that you mention, yes it works. But it doesn't work for sites such as: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV7/ – FBC Jul 18 '13 at 20:12
Dear @dag I got the following error while trying out the suggestion you gave [here] (https://stackoverflow.com/a/17187525/1972786) `url <- 'ftp://XX.XX.XXX.XX/images/uploads_webapp/' > filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE) Error in function (type, msg, asError = TRUE) : Failed to connect to 54.251.104.13 port 21: Connection refused` – Lazarus Thurston Aug 15 '17 at 12:09
@SanjayMehrotra Thank you. I changed the URL now (almost two years after your comment, yes...) – Dag Hjermann Jun 25 '19 at 09:31

Sfroehlich · Answer 2 · 2017-08-23T18:47:33.803

2

Try this:

   library(RCurl)

   dir_list <-
     read.table(
       textConnection(
         getURLContent(ftp://[...]/)
       )
     sep = "",
     strip.white = TRUE)

The resulting table separates the date into 3 text fields, but it is a big start and you can get the filenames.

edited Aug 23 '17 at 18:47

answered Aug 23 '17 at 18:38

Sfroehlich

21
4

score 1 · Answer 3 · answered May 22 '13 at 19:33

I was reading a RCurl document and came across a new piece of code:

stockReader =
function()
{
values <- numeric() # to which the data is appended when received
# Function that appends the values to the centrally stored vector
read = function(chunk) {
con = textConnection(chunk)
on.exit(close(con))
tmp = scan(con)
values <<- c(values, tmp)
}
list(read = read,
values = function() values # accessor to get result on completion
)
}

followed by

reader = stockReader()
getURL(’http://www.omegahat.org/RCurl/stockExample.dat’,
write = reader$read)
reader$values()

it says 'numeric' in the sample but surely this code sample can be adapted? Read the attached document. I'm sure you will find what you're looking for.

It also says

The basic use of getURL(), getForm() and postForm() returns the contents of the requested document as a single block of text. It is accumulated by the libcurl facilities and combined into a single string. We then typically traverse the contents of the document to extract the information into regular data, e.g. vectors and data frames. For example, suppose the document we requested is a simple stream of numbers such as prices of a particular stock at diﬀerent time points. We would download the contents of the ﬁle, and then read it into a vector in R so that we could analyze the values. Unfortunately, this results in essentially two copies of the data residing in memory simultaneously. This can be prohibitive or at least undesirable for large datasets. An alternative approach is to process the data in chunks as it is received by libcurl. If we can be notiﬁed each time libcurl receives data from the reply and do something meaningful with the data, then we need not accumulate the chunks. The largest extra piece of information we will need to have is the largest chunk. In our example, we could take each chunk and pass it to the scan() function to turn the values into a vector. Then we can concatenate this with the vector from the previously processed chunks.

after testing this function on sample link here is its error : `Error: unexpected input in "getURL(’" > write = reader$read) Error: unexpected ')' in " write = reader$read)" > reader$values() Error in reader$values : object of type 'closure' is not subsettable` — Muhammad Usman Saleem, Mar 01 '17 at 04:40

Get website directory listing in an R vector using RCurl

3 Answers3

Linked