Efficient data formats
First of all, if your targets are data frames (or data.table
s) consider saving those targets using the custom "fst" (or "fst_dt") format: https://github.com/ropensci/drake/pull/977. Targets will occupy less space and save more time. That's a quick win.
What should be a target?
Second, let's talk about the scenario you raised where two targets are not all that different.
library(drake)
library(dplyr)
drake_plan(
raw = get_raw_data(),
data = select(-funds) %>%
filter(spending < 900),
analysis = analyze(data)
)
#> # A tibble: 3 x 2
#> target command
#> <chr> <expr>
#> 1 raw get_raw_data()
#> 2 data select(-funds) %>% filter(spending < 900)
#> 3 analysis analyze(data)
Created on 2019-10-22 by the reprex package (v0.3.0)
raw
and data
are basically copies of each other, and they are large. If you only use raw
to compute data
, we can skip raw
and define our own function to go straight to data
. The following plan will use less storage.
library(drake)
library(dplyr)
get_data <- function() {
get_raw_data() %>%
select(-funds) %>%
filter(spending < 900)
}
drake_plan(
data = get_data(),
analysis = analyze(data)
)
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 data get_data()
#> 2 analysis analyze(data)
Created on 2019-10-22 by the reprex package (v0.3.0)
It is excellent practice to define and use functions this way. Not only does it help you be more strategic about the targets you pick, it makes your plan easier to read.
It takes careful thought to work out what is a target and what is a step that goes inside a target. An ideal target is
- Large enough to eat up a lot of runtime, and
- Small enough that
make()
tends to skip it, and
- Meaningful to your project.
Column selection is usually too fast to justify creating a whole new target.
Managing the cache
drake
's cache uses storr
in the backend, which does not store duplicated objects. However, it does store old targets on the off chance that you try to recover them with make(recover = TRUE)
. But if your cache is getting too large, you can remove these historical targets with garbage collection, either with drake_gc()
or drake_cache()$gc()
.
Other tricks to lighten storage are history = FALSE
, log_progress = FALSE
, and recoverable = FALSE
in make()
. These last ones do not reduce overall storage size by much, but the do reduce the number of small files in the cache.
As for memory, there is a chapter in the manual: https://ropenscilabs.github.io/drake-manual/memory.html. To reduce in-session memory consumption, you can choose a custom memory strategy and elect garbage_collection = TRUE
in make()
.