0

I need to download a ton of images in one month.

I've written a script to download small JSON text at a speed of about 200/sec on my personal machine; eventually, I will run my script on a server. (I know that image download will unfortunately be much slower.) The script, shown below, makes asynchronous calls in parallel, which is about three times as fast as making these calls asynchronously but serially.

require(crul)
require(tidyverse)
require(tictoc)
require(furrr)

asyncCalls <- function(i) {
    urls_to_call = all_urls[i:min(i + 99, nrow(all_urls))]
    cc <- Async$new(urls = urls_to_call)  # ready the requests
    res <- cc$get()  # make the requests
    lapply(res, function(z) z$parse("utf-8"))  # parse the crul results
}

all_urls <- paste0("http://placehold.it/640x440&text=image", seq(1, 200))

plan(multiprocess)  # use multiple cores
tic()
metadata <- unlist(future_map(seq(0, floor(nrow(all_urls)/100))*100, ~ asyncCalls(.x)))
toc()

As one would expect, running these image URLs through asyncCalls() returns all elements as NA.

How do I modify the script to allow me to quickly download the images from those URLs? I can't find a file download function in crul, and I'm not sure how to asynchronously use something like download.file(). Thanks!

Prayag Gordy
  • 667
  • 7
  • 18

1 Answers1

0

crul maintainer here.

Async supports writing to disk. You need to pass in a list of file paths that is the same length as the list of URLs. for example:

library(crul)
cc <- Async$new(
  urls = c(
    'https://eu.httpbin.org/get?a=5',
    'https://eu.httpbin.org/get?foo=bar',
    'https://eu.httpbin.org/get?b=4',
    'https://eu.httpbin.org/get?stuff=things',
    'https://eu.httpbin.org/get?b=4&g=7&u=9&z=1'
  )
)
files <- replicate(5, tempfile())
res <- cc$get(disk = files)
out <- lapply(files, readLines)

for your use case you don't have text files, but same logic applies

sckott
  • 5,755
  • 2
  • 26
  • 42