To bring onlookers up to speed, I will try to spell out the problem. @zipzapboing, please correct me if my description is off-target.
Let's say you have a script that generates a drake
plan and executes it.
library(drake)
simulate_data <- function(seed){
set.seed(seed)
rnorm(100)
}
seed_grid <- data.frame(
id = paste0("target_", 1:3),
seed = sample.int(1e6, 3)
)
print(seed_grid)
#> id seed
#> 1 target_1 581687
#> 2 target_2 700363
#> 3 target_3 914982
plan <- map_plan(seed_grid, simulate_data)
print(plan)
#> # A tibble: 3 x 2
#> target command
#> <chr> <chr>
#> 1 target_1 simulate_data(seed = 581687L)
#> 2 target_2 simulate_data(seed = 700363L)
#> 3 target_3 simulate_data(seed = 914982L)
make(plan)
#> target target_1
#> target target_2
#> target target_3
make(plan)
#> All targets are already up to date.
Created on 2018-11-12 by the reprex package (v0.2.1)
The second make()
worked just fine, right? But if you were to run the same script in a different session, you would end up with a different plan. The randomly-generated seed
arguments to simulate_data()
would be different, so all your targets would build from scratch.
library(drake)
simulate_data <- function(seed){
set.seed(seed)
rnorm(100)
}
seed_grid <- data.frame(
id = paste0("target_", 1:3),
seed = sample.int(1e6, 3)
)
print(seed_grid)
#> id seed
#> 1 target_1 654304
#> 2 target_2 252208
#> 3 target_3 781158
plan <- map_plan(seed_grid, simulate_data)
print(plan)
#> # A tibble: 3 x 2
#> target command
#> <chr> <chr>
#> 1 target_1 simulate_data(seed = 654304L)
#> 2 target_2 simulate_data(seed = 252208L)
#> 3 target_3 simulate_data(seed = 781158L)
make(plan)
#> target target_1
#> target target_2
#> target target_3
Created on 2018-11-12 by the reprex package (v0.2.1)
One solution is to be extra careful to hold onto the same plan
. However, there is an even easier way: just let drake
set the seeds for you. drake
automatically gives each target its own reproducible random seed. These target-level seeds are deterministically generated by a root seed (the seed
argument to make()
) and the names of the targets.
library(digest)
library(drake)
library(magrittr) # defines %>%
simulate_data <- function(){
mean(rnorm(100))
}
plan <- drake_plan(target = simulate_data()) %>%
expand_plan(values = 1:3)
print(plan)
#> # A tibble: 3 x 2
#> target command
#> <chr> <chr>
#> 1 target_1 simulate_data()
#> 2 target_2 simulate_data()
#> 3 target_3 simulate_data()
tmp <- rnorm(1)
digest(.Random.seed) # Fingerprint of the current seed.
#> [1] "0bbddc33a4afe7cd1c1742223764661c"
make(plan)
#> target target_1
#> target target_2
#> target target_3
make(plan)
#> All targets are already up to date.
# The targets have different seeds and different values.
readd(target_1)
#> [1] -0.05530201
readd(target_2)
#> [1] 0.03698055
readd(target_3)
#> [1] 0.05990671
clean() # Destroy the targets.
tmp <- rnorm(1) # Change the global seed.
digest(.Random.seed) # The seed changed.
#> [1] "5993aa5cff4b72a0e14fa58dc5c5e3bf"
make(plan)
#> target target_1
#> target target_2
#> target target_3
# The targets were regenerated with the same values (same seeds).
readd(target_1)
#> [1] -0.05530201
readd(target_2)
#> [1] 0.03698055
readd(target_3)
#> [1] 0.05990671
# You can recover a target's seed from its metadata.
seed <- diagnose(target_1)$seed
print(seed)
#> [1] 1875584181
# And you can use that seed to reproduce
# the target's value outside make().
set.seed(seed)
mean(rnorm(100))
#> [1] -0.05530201
Created on 2018-11-12 by the reprex package (v0.2.1)
I really should write more in the manual about how seeds work in drake
and highlight the original pitfall raised in this thread. I doubt you are the only one who struggled with this issue.