0

Reading the documentation of the drake package, I found no other way to define the order of the targets without the use of 'file_in' and 'file_out'.

file_in() marks individual files (and whole directories) that your targets depend on.

file_out() marks individual files (and whole directories) that your targets create.

It is not possible, however, to use both with dynamic targets.

So how can I define an order that should be followed between dynamic targets? I also tried to use make(plan, targets = c("ftp_list", "download.dbc", "dbc_list", "generate_parquet")), but it didn't work

In the code below, for example, I have four targets. What I'd like (order):

  1. Get ftp list from the server
  2. Download the first file from the ftp list (few space in the hd to download all)
  3. Get the downloaded file
  4. Convert as .parquet (and then, start over. download the second file, convert to parquet...)

Any idea how I can link dynamic targets without using file_in and file_out (not allowed in this case)? Thanks!

Code just as example:

URL <- "ftp://ftp.url"
LOCAL_PATH <- paste0(getwd())

plan <- drake_plan(

  ftp_list = obtain_filenames_from_url(url_ = URL, 
                                       remove_extension_from_filename_ = FALSE,
                                       full_names = TRUE)[0:10],

  download.dbc = target(download_dbc(ftp_list, 
                                local_path = paste0(LOCAL_PATH, "/")), 
                   dynamic = map(ftp_list)),

  dbc_list = target(list.files(LOCAL_PATH, full.names = TRUE, 
                               pattern = "*.dbc")),

  generate_parquet = target(convert_dbc(dbc_list, delete_dbc_after_conversion = TRUE),  
                            dynamic = map(dbc_list))
)

plan graph output:

enter image description here

1 Answers1

1

Target order

file_in() and file_out() are only necessary when you actually need to work with files, directories, or URLs. drake targets are R objects, and target order is determined by how targets are mentioned in commands. drake reads your commands and functions with static code analysis to resolve target order. In the plan below, targets a, b, and c are in an arbitrary order, but drake runs them in the correct order because of how the symbols are mentioned.

library(drake)

plan <- drake_plan(
  c = head(b),
  a = mtcars[, seq_len(3)],
  b = tail(a)
)

plot(plan)


make(plan)
#> target a
#> target b
#> target c

readd(c) # Targets are R objects
#>                 mpg cyl  disp
#> Porsche 914-2  26.0   4 120.3
#> Lotus Europa   30.4   4  95.1
#> Ford Pantera L 15.8   8 351.0
#> Ferrari Dino   19.7   6 145.0
#> Maserati Bora  15.0   8 301.0
#> Volvo 142E     21.4   4 121.0

Created on 2020-02-07 by the reprex package (v0.3.0)

Your plan

Here are some things that could help your current plan.

  1. Use file_in() on ftp://ftp.url to detect when ftp_list should update.
  2. Define a function (say, get_dbc()) to download some files (part of the ftp_list) and read them into memory.
  3. Skip converting to Parquet. Instead, return data frames as the sub-targets' values. Then, drake will automatically store those data frames in fst files.

Related:

Sketch:

get_dbc_data_frame <- function(ftp_list_entry) {
  # 1. Download the files from the ftp_list_entry.
  # 2. Read them into memory.
  # 3. Return a data frame.
}

plan <- drake_plan(
  ftp_list = obtain_filenames_from_url(
    url_ = file_in("ftp://ftp.url"), 
    remove_extension_from_filename_ = FALSE,
    full_names = TRUE
  )[seq(0, 10)],
  dbc_data = target(
    get_dbc_data_frame(ftp_list, local_path = paste0(getwd(), "/")),
    format = "fst", # Tell drake to store the data frame as an fst file.
    dynamic = map(ftp_list)
  )
)
landau
  • 5,636
  • 1
  • 22
  • 50
  • Is very good to understand how the order of targets is defined on drake. Tks! However, I cannot skip the parquet conversion. There are many other steps (which I haven't added - e.g. put parquet files on s3 [aws]; create athena database [aws]) that require files in this format. So, do you have any tips on what I can do to make 'download.dbc' necessarily run before 'generate_parquet'? Tks a lot! – Marcos Freitas Feb 08 '20 at 20:56
  • Unfortunately, `file_out()` does not allow you to define dynamic files, so you would not be able to write dynamic file_out()s from `dbc_data` sub-targets. One option is to just write the files in the sub-targets without tracking them with `file_out()`. Slightly less reproducible, but not terrible if you are careful. Another option is to define another target downstream of the existing `dbc_data` above which takes all the sub-targets and writes them to parquet all in one target. – landau Feb 08 '20 at 22:22
  • Just curiosity.Are there any plans to include dynamic files for file_out()? – Marcos Freitas Feb 18 '20 at 18:05
  • Not for `file_out()` specifically, but I'm going to give it another try using a different mechanism: https://github.com/ropensci/drake/issues/1168. The tricky part about `file_out()` is that it needs to be static so `drake` can figure out the dependency graph. With dynamic files as targets, the dependency graph does not need to change. – landau Feb 18 '20 at 18:08