1

When need to understand my drake plan, vis_drake_graph() comes in handy, and it displays the time that each target took to run. This is very helpful in figuring out whether targets should be broken down to reduce re-run time on small changes.

My need is related: Because many of my long-running targets involve the manipulation of large data sets, it is important for me to understand the size that each cached target takes on disk. This would help me understand if targets should be combined to prevent the storage of huge intermediate results (even if it would increase re-run time in case of a change to the combined target).

Examining both the config object and the intermediate object returned by drake_graph_info(), I have not been able to find this information. I was thinking that it might be very useful to have this information, and potentially other information (such as the time a target was last run) shown by specifying parameters to vis_drake_graph() or even just by examining the config object manually.

So the question is, is there a way to get this information?

Magnus
  • 23,900
  • 1
  • 30
  • 28

2 Answers2

2

drake uses a package called storr to handle the storage of targets. As far as I know, storr does not have an easy way to get file size information. However, at least for the default storr_rds() cache type, maybe it should. You could request it as a feature. If implemented, we would have an easier version of the following workaround, at least in the case of RDS caches.

library(drake)
load_mtcars_example()
make(my_plan, verbose = 0L)
cache <- get_cache() # or storr::storr_rds(".drake")
root <- cache$driver$path
hash <- cache$driver$get_hash("small", namespace = "objects")
path <- file.path(root, "data", paste0(hash, ".rds"))
file.exists(path)
#> [1] TRUE
file.size(path)
#> [1] 404

Created on 2019-05-07 by the reprex package (v0.2.1)

drake is all about repetition and runtime, and storr is all about data and storage. As we think about new features, I would prefer to keep these separate goals in mind.

landau
  • 5,636
  • 1
  • 22
  • 50
2

Thanks for the answer @landau, using this info I implemented a function that reports the size of a target, allowing one to quickly check the sizes of all targets in a plan:

library(tibble)
library(drake)

get_target_size <- function(target) {
    cache <- get_cache() # or storr::storr_rds(".drake")
    root  <- cache$driver$path
    hash  <- cache$driver$get_hash(target, namespace = "objects")
    path  <- file.path(root, "data", paste0(hash, ".rds"))
    if ( file.exists(path) ) {
        file.size(path)
    } else {
      NA
    }
}

load_mtcars_example()
make(my_plan, verbose = 0L)
tibble( target = my_plan$target, 
        size = sapply(my_plan$target, get_target_size))

The output is:

# A tibble: 15 x 2
   target                  size
   <chr>                  <dbl>
 1 report                    55
 2 small                    404
 3 large                    463
 4 regression1_small       2241
 ...

I think that will suffice for my needs, and understand that it might not make sense to implement it as part of drake unless there was a more general solution that worked for any storage type.

Magnus
  • 23,900
  • 1
  • 30
  • 28