R {drake} plan: Read many datasets into single target

Question

I started to use {drake} for a data production pipeline. The raw data I work with is quite large and is split up into ~130 separate (Stata) files. Thus, each file should be processed separately. In order to keep it readable, I use target(), transform() and map() to specify my plan. This looks similar to the code below:

plan <- drake_plan(
    dta_paths = list.files(my_folder, full.names = TRUE),
    dfs = target(
        read.dta13(dta_path),
        transform = map(dta_path = dta_paths)
    )
)

So when I make() the plan, I get the following error:

target dfs_dta_paths

Warning: target dfs_dta_paths warnings:

the condition has length > 1 and only the first element will be used

the condition has length > 1 and only the first element will be used

the condition has length > 1 and only the first element will be used

fail dfs_dta_paths

Error: Target dfs_dta_paths failed. Call diagnose(dfs_dta_paths) for details. Error message:

Expecting a single string value: [type=character; extent=129].

From what I understand from this warning and error messages, the mapping over the different file paths is not working and the full vector is passed to the first function call. I read https://books.ropensci.org/drake/static.html#map but it did not help in figuring out the problem. Also converting the vector of paths to a list did not help.

From How to combine multiple drake targets into a single cross target without combining the datasets? I got the idea of predefining a grid, which actually works as suggested. But since I do only need a vector, not a complex grid, this looks like over-engineering to me.

I feel like I'm missing something obvious, but I can't spot it. Any ideas what's wrong with my code?

I am aware of https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets, but since I want to iterate in the process of data cleaning, I thought it would be helpful to create the dfs target as shown above.

Update: you may be interested in dynamic files: https://github.com/ropensci/drake/pull/1178. Brand new in development `drake` (the GitHub version, `remotes::install_github("ropensci/drake")). — landau, Feb 22 '20 at 13:31

score 2 · Accepted Answer · answered Jan 16 '20 at 20:57

When you use target(transform = ...), it is always a best to visualize the plan before you feed it to make(). It could take a couple iterations to get it right. Here is what your current plan looks like.

library(drake)
plan <- drake_plan(
  dta_paths = list.files(my_folder, full.names = TRUE),
  dfs = target(
    read.dta13(dta_path),
    transform = map(dta_path = dta_paths)
  )
)

plan
#> # A tibble: 2 x 2
#>   target        command                                 
#>   <chr>         <expr>                                  
#> 1 dta_paths     list.files(my_folder, full.names = TRUE)
#> 2 dfs_dta_paths read.dta13(dta_paths)

config <- drake_config(plan)
vis_drake_graph(config)

^{Created on 2020-01-16 by the reprex package (v0.3.0)}

To read one file per target, I recommend the plan below. See https://books.ropensci.org/drake/static.html#tidy-evaluation for more on why it uses !!.

library(drake)

# create some faux stata files for the example.
my_folder <- fs::dir_create("folder")
file.create("folder/file1.dta")
#> [1] TRUE
file.create("folder/file2.dta")
#> [1] TRUE

# Since you are using static branching (https://books.ropensci.org/drake/static.html)
# this needs to be defined up front.
# It does not need to be a target, re https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets
dta_paths <- list.files(my_folder, full.names = TRUE)

plan <- drake_plan(
  dfs = target(
    # Use !! here to literally insert the path so file_out() can mark it for tracking.
    read.dta13(file_in(!!dta_path)),
    # Use !! here to insert the actual vector of paths instead of the symbol `dta_paths`
    transform = map(dta_path = !!dta_paths)
  )
)

plan
#> # A tibble: 2 x 2
#>   target               command                                
#>   <chr>                <expr>                                 
#> 1 dfs_folder.file1.dta read.dta13(file_in("folder/file1.dta"))
#> 2 dfs_folder.file2.dta read.dta13(file_in("folder/file2.dta"))

config <- drake_config(plan)
vis_drake_graph(config)

^{Created on 2020-01-16 by the reprex package (v0.3.0)}

Something you might not have realized: `target(transform = ...)` tries to create multiple literal static targets in the plan, i.e. more literal rows in the data frame: . `drake` does have dynamic branching capabilities () but `file_in()` still needs to receive the literal file path in the plan as a literal string (not a variable: ) — landau, Jan 16 '20 at 21:00
Thanks for directing me to dynamic branching. I got both versions to work, but I'm not sure yet, which one fits my purposes better. In this context one more question: I only need tidy evaluation (and !!) when I am referring to objects which are "external" to the plan (like the paths in your example above or 1:4 from the example in your book)? Is that correct? — der_grund, Jan 17 '20 at 13:52
Glad to hear it. Since your pipeline starts with files, I recommend `file_in()` + static branching so each target gets invalidated if the file changes. In `drake`, `file_in()` files are always static. If you use dynamic branching, `file_in("directory_with_data")` will invalidate *all* sub-targets if *any* data file changes, which is not as helpful if you want to save time in subsequent `make()`s. — landau, Jan 17 '20 at 17:12

R {drake} plan: Read many datasets into single target

1 Answers1