Avoiding saving cache for a target in R package drake

Question

I've seen that by default the R package drake saves all cache for each target. Sometimes, a target is just selecting some columns from the previous target but if the data is really big, this means that you get two saved targets which are really big. In addition, if you change a target slightly, I think drake saves a new copy of this target but keeps the previous one. This means that every call to r_make consistently saves your cache accumulating a lot of memory.

Is there any way of selecting which targets drake saves?
Is there a way of avoiding keeping the history of cache files of a target?

This accumulation is occupying over 45GB in my machine, which seems way off.

Thanks

score 3 · Answer 1 · answered Oct 22 '19 at 13:05

Efficient data formats

First of all, if your targets are data frames (or data.tables) consider saving those targets using the custom "fst" (or "fst_dt") format: https://github.com/ropensci/drake/pull/977. Targets will occupy less space and save more time. That's a quick win.

What should be a target?

Second, let's talk about the scenario you raised where two targets are not all that different.

library(drake)
library(dplyr)

drake_plan(
  raw = get_raw_data(),
  data = select(-funds) %>%
    filter(spending < 900),
  analysis = analyze(data)
)
#> # A tibble: 3 x 2
#>   target   command                                  
#>   <chr>    <expr>                                   
#> 1 raw      get_raw_data()                           
#> 2 data     select(-funds) %>% filter(spending < 900)
#> 3 analysis analyze(data)

^{Created on 2019-10-22 by the reprex package (v0.3.0)}

raw and data are basically copies of each other, and they are large. If you only use raw to compute data, we can skip raw and define our own function to go straight to data. The following plan will use less storage.

library(drake)
library(dplyr)

get_data <- function() {
  get_raw_data() %>%
    select(-funds) %>%
    filter(spending < 900)
}

drake_plan(
  data = get_data(),
  analysis = analyze(data)
)
#> # A tibble: 2 x 2
#>   target   command       
#>   <chr>    <expr>        
#> 1 data     get_data()
#> 2 analysis analyze(data)

^{Created on 2019-10-22 by the reprex package (v0.3.0)}

It is excellent practice to define and use functions this way. Not only does it help you be more strategic about the targets you pick, it makes your plan easier to read.

It takes careful thought to work out what is a target and what is a step that goes inside a target. An ideal target is

Large enough to eat up a lot of runtime, and
Small enough that make() tends to skip it, and
Meaningful to your project.

Column selection is usually too fast to justify creating a whole new target.

Managing the cache

drake's cache uses storr in the backend, which does not store duplicated objects. However, it does store old targets on the off chance that you try to recover them with make(recover = TRUE). But if your cache is getting too large, you can remove these historical targets with garbage collection, either with drake_gc() or drake_cache()$gc().

Other tricks to lighten storage are history = FALSE, log_progress = FALSE, and recoverable = FALSE in make(). These last ones do not reduce overall storage size by much, but the do reduce the number of small files in the cache.

As for memory, there is a chapter in the manual: https://ropenscilabs.github.io/drake-manual/memory.html. To reduce in-session memory consumption, you can choose a custom memory strategy and elect garbage_collection = TRUE in make().

I meant to say that "fst" targets will occupy *less* storage, not more. — landau, Oct 22 '19 at 13:10
Perfect, great answer! One last thing: there is not way to specifying specific targets to not be saved? (just for them to be ran every time you run `make()` without saving them) — cimentadaj, Oct 22 '19 at 13:20
There is not a specific option, but there is a hack: just return an empty value from a command, e.g. `drake_plan(x = {stuff(); NULL})` or define `stuff()` with a `NULL` return value. — landau, Oct 22 '19 at 14:14

Avoiding saving cache for a target in R package drake

1 Answers1

Efficient data formats

What should be a target?

Managing the cache