1

I have a lot of files I need to download.

I am using download.file() function and furrr::map to download in parallel, with plan(strategy = "multicore").

Please advise how can I load more jobs for each future?

Running on Ubuntu 18.04 with 8 cores. R version 3.5.3.

The files can be txt, zip or any other format. Size varies in range of 5MB - 40MB each.

SteveS
  • 3,789
  • 5
  • 30
  • 64

1 Answers1

2

Using furrr works just fine. I think what you mean is furrr::future_map. Using multicore substantially increases the downloading speed (Note: on Windows, multicore is not available, only multisession. Use multiprocess if you are unsure what platform your code will be run on).

library(furrr)
#> Loading required package: future

csv_file <- "https://raw.githubusercontent.com/UofTCoders/rcourse/master/data/iris.csv"

download_template <- function(.x) {
    temp_file <- tempfile(pattern = paste0("dl-", .x, "-"), fileext = ".csv")
    download.file(url = csv_file, destfile = temp_file)
}

download_normal <- function() {
    for (i in 1:5) {
        download_template(i)
    }
}

download_future_core <- function() {
    plan(multicore)
    future_map(1:5, download_template)
}

download_future_session <- function() {
    plan(multisession)
    future_map(1:5, download_template)
}

library(microbenchmark)

microbenchmark(
    download_normal(),
    download_future_core(),
    download_future_session(),
    times = 3
)
#> Unit: milliseconds
#>                       expr       min        lq      mean    median
#>          download_normal()  931.2587  935.0187  937.2114  938.7787
#>     download_future_core()  433.0860  435.1674  488.5806  437.2489
#>  download_future_session() 1894.1569 1903.4256 1919.1105 1912.6942
#>         uq       max neval
#>   940.1877  941.5968     3
#>   516.3279  595.4069     3
#>  1931.5873 1950.4803     3

Created on 2019-03-25 by the reprex package (v0.2.1)

Keep in mind, I am using Ubuntu, so using Windows will likely change things, since as far as I understand future doesn't allow multicore on Windows.

I am just guessing here, but the reason that multisession is slower could be because it has to open up several R sessions before running the download.file function. I was just downloading a very small dataset (iris.csv), so maybe on larger datasets that take more time, the time taken to open an R session would be offset by the time it takes to download larger files.

Minor update:

You can pass a vector of URLs to the datasets into future_map so it downloads each file as determined by the future package processing:

data_urls <- c("https:.../data.csv", "https:.../data2.csv")
library(furrr)
plan(multiprocess)
future_map(data_urls, download.file)
# Or use walk 
# future_walk(data_urls, download.file)
Luke W. Johnston
  • 954
  • 9
  • 17
  • How can I ask each worker to download say 5 files at once? I have 8 cores and I am downloading 8 files each time. I want to download 40 each time. @like-w-johnston – SteveS Mar 25 '19 at 12:46
  • 1
    @SteveS not sure I understand. Can you add some code in your question that shows what you've done so far? The code I've put here should be mostly in the form you need, but I don't know exactly unless you provide a [MWE](https://stackoverflow.com/help/mcve) – Luke W. Johnston Mar 25 '19 at 12:52
  • @luke-w-johnston, I mean tweak future_options to make it download more files per job sent to a worker. Actually it's same code like yours. – SteveS Mar 25 '19 at 13:24
  • 1
    Not really sure that's now parallel processing works, at least with the futures package. Why not do something like `future_map(vector_of_urls, download_file_function)`. That will process the exact number of jobs as it runs. You can't really control how to download `n` number of jobs *inside* the parallel processing function. – Luke W. Johnston Mar 25 '19 at 13:32
  • 2
    See the docs for future, multiprocess is basically multisession on Windows and multicore on Unix-alikes. Including multiprocess in the comparison is thus redundant. – Hong Ooi Mar 25 '19 at 14:15
  • @HongOoi Updated the answer to reflect your comment. – Luke W. Johnston Mar 26 '19 at 11:53