0

I'm trying to optimize a highly parallel and memory-intensive targets pipeline. I'm noticing that the wall clock time for downstream dynamic branch targets is much longer than the reported execution time for the same target. Example:

● built branch PSUT_Re_all_Chop_all_Ds_all_Gr_all_f29c72e5 [11.05 seconds]

Wall clock time: 20.07 seconds.

To optimize, I would like to reduce the discrepancy between wall clock time and execution time, if possible. But what could be causing this discrepancy?

Background:

  • The input data for each branch target (e.g., _f29c72e5) is created dynamically from rows of a (much) larger upstream data frame target.
  • I set storage = "worker" and retrieval = "worker", as suggested for highly parallel pipelines at https://books.ropensci.org/targets/performance.html.
  • I set memory = "transient" and garbage_collection = TRUE as suggested for high-memory pipelines at https://books.ropensci.org/targets/performance.html.
  • The entire upstream (input) data frame takes about 8 seconds to read from disk with tar_read() in an interactive session, nearly the full discrepancy between wall clock time and execution time.

Thus, my working theory is that each dynamically created downstream branch is loading the entire upstream target, then slicing, then sending the slices to each branch target's function.

Is that theory plausible? If so, I will create an example project and post another question for how to solve this problem.

Thanks in advance for insights.

1 Answers1

0

There are a couple things you could try. One is to profile the pipeline and look at the flame graph to see what is slowing things down.

proffer::pprof(targets::tar_make(callr_function = NULL, reporter = "silent"))

You may want to run this in the terminal instead of RStudio because the latter sometimes has a strange interaction with proffer and targets together.

If reading that dataset really is the bottleneck, you could set up your pipeline like this:

library(targets)
tar_option_set(memory = "transient", garbage_collection = TRUE)
list(
  tar_target(big_data, get_big_data(), memory = "persistent", deployment = "main"),
  tar_target(data_slice, big_data, pattern = map(big_data), deployment = "main"),
  tar_target(model, analyze_slice(data_slice), pattern = map(data_slice))
)

The first time you run the pipeline, you could call tar_make(data_slice) to build all the slices locally while keeping the big dataset in memory. (If you are using crew, I recommend commenting out the controller at this step.) Then if data_slice is all up to date, you could run a second tar_make() (or e.g. tar_make_clustermq()) to run the rest of the targets. At this second tar_make(), big_data and data_slice are up to date, so the full dataset should not need to load at all.

you could try setting memory = "persistent" just for that upstream data target, just while you are building branches.

landau
  • 5,636
  • 1
  • 22
  • 50
  • NB garbage collection might create overhead if there are a lot of branches and not much parallelism. – landau Jun 04 '23 at 11:07