0

I have a problem with an FTP server that slows dramatically after returning a few files.

I am trying to access data from a government server at the National Snow and Ice Data Center, using an R script and the RCurl library, which is a wrapper for libcurl. The line of code I am using is this (as an example for a directory listing):

getURL(url="ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/")

or this example, to download a particular file:

getBinaryURL(url="ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/2013.07.28/MOD10A2.A2013209.h26v04.005.2013218193414.hdf

I have to make the getURL() and getBinaryURL() requests frequently because I am picking through directories looking for particular files and processing them as I go.

In each case, the server very quickly returns the first 5 or 6 files (which are ~1 Mb each), but then my script often has to wait for 10 minutes or more until the next files are available; in the meantime the server doesn't respond. If I restart the script or try curl from the OSX Terminal, I again get a very quick response for the first few files, then a massive slowdown.

I am quite sure that the server's behavior has something to do with preventing DOS attacks or limiting bandwidth used by bots or ignorant users. However, I am new to this stuff and I don't understand how to circumvent the slowdown. I've asked the people who maintain the server but I don't have a definitive answer yet.

Questions:

Assuming for a moment that this problem is not unique to the particular server, would my goal generally be to keep the same session open, or to start new sessions with each FTP request? Would the server be using a cookie to identify my session? If so, would I want to erase or modify the cookie? I don't understand the role of handles, either.

I apologize for the vagueness but I'm wandering in the wilderness here. I would appreciate any guidance, even if it's just to existing resources.

Thanks!

John
  • 1,018
  • 12
  • 19
  • 2
    The server is throttling the connection. You can contact them and ask if you can be made an exception. The only other way to "circumvent" this that I know of would involve illegal hacking attempts -- as they clearly want users throttled. – Matt Runion Jun 08 '16 at 21:37
  • This question might help: http://stackoverflow.com/questions/6412212/libcurl-keep-connection-open-to-upload-multiple-files-ftp – effel Jun 09 '16 at 00:05
  • Try `ftp_han <- curl::new_handle() curl_download("ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/2013.07.28/MOD10A2.A2013209.h26v04.005.2013218193414.hdf", "MOD10A2.A2013209.h26v04.005.2013218193414.hdf", handle=ftp_han)` (re-using an existing handle) but it does indeed look like @mrunion is right abt the throttling. – hrbrmstr Jun 09 '16 at 01:29
  • Thanks very much for your suggestions. I'm in conversation with the server people now and will update the question when I have a solution. – John Jun 09 '16 at 06:39

1 Answers1

0

The solution was to release the curl handle after making each FTP request. However, that didn't work at first because R was hanging onto the handle even though it had been removed. The solution (provided by Bill Dunlap on the R help list) was to call garbage collection. In summary, the successful code looked like this:

for(file in filelist){
  curl<-getCurlHandle() #create a new curl handle
  getURL(url=file, curl=curl,...)  #download the file
  rm(curl) #remove the curl
  gc() #the magic call to garbage collection, without which the above does not work
}

I still suspect that there may be a more elegant way to accomplish the same thing using the RCurl library, but at least this works.

John
  • 1,018
  • 12
  • 19