1

Hi I am new to the drake R package and would like to hear some opinions on best practice in using subtasks to manage a large project. A simplified structure of my project has two parts: 1) data cleaning and 2) modeling. They are cascaded in the sense that I do data cleaning first, then I rarely go back when I start the modeling part.

I think the approach suggested by the manual is:

source("functions_1.R") # for plan_1
plan1 <- drake_plan(
    # many middle steps to create
    foo = some_function()
    foo_1 = fn_1(foo)
    foo_2 = fn_2(foo_1)
    for_analysis = data_cleaning_fn()
)
plan2 <- drake_plan(
    # I would like to use the target name foo_1 again, but not the same object as they were defined in plan1. 
    # What I want:
    # foo_1 = fn_new_1(for_analysis) # this is different from above defined
    # result = model_fn(for_1)

    # What I actually did
    foo_new_1 = fn_new_1(for_analysis) # I have to define a new name different from foo_1
    result = model_fn(foo_new_1)
)
fullplan <- bind_plans(plan1,plan2)
make(fullplan)

One problem I had in the above workflow is that I have a lot of intermediate targets defined for plan1, but they are useless in plan2.

  1. Is there a way that I can have a "clean namespace" in plan2 so that I can get rid of the useless names foo_1 and foo_2 etc? So that I can reuse these names in plan2. What I only want to keep in plan_2 is for_analysis.
  2. Is there a way that I can use functions defined in functions_1.R only for plan1 and functions defined in functions_2.R only for plan2? I would like to work with a smaller set of functions each time.

Thank you a lot!

Kallas
  • 69
  • 5
  • Your sketch looks good, it seems like you are using `drake` correctly. But for (1), (2), and below, I am having trouble understanding what you mean. It would help if you could elaborate and maybe sketch pseudo-code of the problem. – landau Jun 23 '20 at 00:51
  • @landau thank you a lot for your comment! I basically would like to drop all the intermediate target names after building `for_analysis` so that I can re-use the target names in `plan2`. I have made some edits in the post to make it more clear. – Kallas Jun 23 '20 at 03:14

1 Answers1

0

Interesting question. drake does not support multiple namespaces in plans. All target names must be unique and all function names must be unique, so if you want to reuse names, you would need to put those plans in separate projects altogether.

You may be running into a situation where you are defining too many targets. Speaking broadly, targets should either (1) produce meaningful output for your project, or (2) eat up enough runtime so that skipping them saves you time. I recommend reading https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets. To condense multiple targets into one, I recommend composing functions together. Example:

foo_all <- function()
  # Each middle step is super quick, but all put together, they take up noticeable runtime.
  foo <- some_function()
  foo_1 <- fn_1(foo)
  foo_2 <- fn_2(foo_1)
  for_analysis = data_cleaning_fn()
)

plan1 <- drake_plan(
  for_analysis = foo_all()
)

Also, drake's branching mechanisms are a convenient way to automatically generate names or avoid having to think about names too hard. Maybe have a look at https://books.ropensci.org/drake/static.html and https://books.ropensci.org/drake/dynamic.html.

landau
  • 5,636
  • 1
  • 22
  • 50
  • Thanks @landau ! Make an upper level function `foo_all()` seems to fit my need best. The branching feature also looks promising. – Kallas Jun 23 '20 at 14:01
  • Just for clafirification, by "put those plans in seperate projects", do you mean that: 1) I literally have different R-projects (as they are defined in Rstudio); or just that: 2) I have to make two different `plan_X.R` files and run them in different R sessions? If it's 2, I think this is also an attractive solution for me, because I can simply export the result from plan1. It would be more problematic if what you mean is 1). – Kallas Jun 23 '20 at 14:09
  • I meant (2), which also requires custom `drake` caches in different directories. `(make()` has a `cache` argument that accepts the output of `drake_cache()` or a [`storr`](https://github.com/richfitz/storr) object.) But that seems extreme for your situation. – landau Jun 23 '20 at 15:09
  • Got you, this is very helpful! I will take a deeper look. Thank you @landau! – Kallas Jun 23 '20 at 16:24