1

I have a workflow that I run against variations of essentially the same dataset (It's an emr extract, sometimes I run against iterations of the bulk extract, and sometimes against iterations of test extracts).

These datasets are (Supposed to be) homogenous, and have the same processing requirements in general.

That said, before I migrated the project to drake, a lot of the analysis had been performed on a subset of one of the test datasets, sometimes semi-interactively, with little guarantee of reproducibility.

Though generally across my datasets I don't wish to filter the dataset on the same criteria the analysts started from, for some datasets it's helpful in order to verify that the workflow is in fact producing the same results for the same input as the original analysis.

An example of the starting filter the analists may have used:

filter_extract_window <- function(df) {
  start <- lubridate::dmy("01-04-2017")
  end   <- lubridate::dmy("30-06-2017")

  df %>%
    dplyr::filter(admit_dttm > start, admit_dttm < end) %>%
    return()
}

A given dataset is stored fully separately to the project's code, in a directory tree that contains that datasets' drake_cache, and a subdirectory of the raw data.

My question is then - What's a nice way to import such a function into my workflow, without it being a statically declared import?

  • You can write `ignore(filter_extract_window)(df = your_df)` whenever you call the function and `drake` will not track it. – landau Feb 11 '20 at 12:21
  • I'm having a little trouble understanding the motivation though. If you are starting from the subset of the data someone else created, when would you call `filter_extract_window()`? And if you start from the full dataset to do your own processing/subsetting, for what reason would you call `filter_extract_window()` but want to hide it from `drake`? – landau Feb 11 '20 at 12:23
  • 1
    Probably this issue linked from the FAQ may have served me better / Demonstrates similar requiremnts to mine https://github.com/ropensci/drake/issues/706 – Matthew Strasiotto Feb 17 '20 at 04:13

1 Answers1

1

Given some thought, and the time it took to write this question out, I think the following approach will suit this workflow.

Define filter_extract_window, or any equivalent function within the code-base / a package, as you normally would, ie :

within {mypackage}:

filter_extract_window <- function(df) {
  start <- lubridate::dmy("01-04-2017")
  end   <- lubridate::dmy("30-06-2017")

  df %>%
    dplyr::filter(admit_dttm > start, admit_dttm < end) %>%
    return()
}

Place a script "filter.R" in the same directory that you keep your data, the script should be defined as follows:

# The following function is what I want to use to filter this dataset
mypackage::filter_extract_window

In your codebase (eg, {mypackage}) , write a function that evaluates this:

eval_script <- function(path) {
  out <- identity
  if (! file.exists(path) ) return(out)

  out <- parse(file = path) %>% 
    eval(envir = new.env())

  out
}

Now, for your drake::drake_plan, we might see the following:

data_root <- "/path/to/your/data"

plan <- drake::drake_plan(
   filter_fn = drake::target(
      mypackage::eval_script(file.path(!!data_root, "filter.R")),
      trigger = drake::trigger(
        change = mypackage::eval_script(file.path(!!data_root, "filter.R"))
      )
   )
)

eval_script should return fast enough that using it for the trigger should be fine in this instance.