R targets multiple file outputs

Question

I am looking into using R's targets but I am struggling to have it accept multiple file outputs.

For example, I want to be able to take a dataset, create a train/test split and write each dataset to a separate file.

An MWE would be

_targets.R

library(targets)
source("R/functions.R")

set.seed(124)

list(
  # created using write.csv(mtcars, "data/mtcars.csv")
  tar_target(raw_data, "data/mtcars.csv", format = "file"),
  tar_target(data, read.csv(raw_data),
  # this throws an error here:
  tar_target(train_test, split_dataset(data), format = "file"),
# this only shows how I would try to use the train/test datasets
  tar_target(model, train_model(train_test)),
  tar_target(eval, eval_model(model, train_test))
)

where split_dataset() is defined in R/functions.R

split_dataset <- function(data) {
    idx <- sample.int(nrow(data), 0.8 * nrow(data))
    train <- data[idx, ]
    test <- data[-idx, ]
    write.csv(train, "data/train.csv")
    write.csv(test, "data/test.csv")
    return(c("data/train.csv", "data/test.csv"))
  }

One alternative would be to use a list list(train = train, test = test) but I want to be able to access either dataset if possible and save the datasets as separate files.

Another alternative approach would be to define the index in the targets list, split the dataset and write each dataset in a separate target. If possible I would like to condense the steps into one (as shown above) to make the targets file easier to understand.

score 4 · Accepted Answer · answered Mar 30 '21 at 16:15

I recommend appending idx as a column to data and then filtering on it later for the train and test targets. Also, you do not need format = "file" to be able to access datasets later. You can use tar_read() or tar_load() for that. Sketch:

library(targets)
library(tibble)

dir.create("data")
write.csv(mtcars, "data/mtcars.csv")

tar_script({
  library(tibble)
  split_data <- function(data) {
    idx <- sample.int(n = nrow(data), size = 0.8 * nrow(data))
    data$is_training <- seq_len(nrow(data)) %in% idx
    as_tibble(data)
  }
  
  list(
    tar_target(raw_data, "data/mtcars.csv", format = "file"),
    tar_target(data, split_data(read.csv(raw_data)), format = "feather"),
    tar_target(train, data[data$is_training, ], format = "feather"),
    tar_target(test, data[!data$is_training, ], format = "feather")
  )
})

tar_visnetwork()


tar_make()
#> ● run target raw_data
#> ● run target data
#> ● run target test
#> ● run target train
#> ● end pipeline

tar_read(train)
#> # A tibble: 25 x 13
#>    X             mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#>  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10 Merc 280C    17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#> # … with 15 more rows, and 1 more variable: is_training <lgl>

tar_read(test)
#> # A tibble: 7 x 13
#>   X              mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 Merc 280      19.2     6 168.    123  3.92  3.44  18.3     1     0     4     4
#> 2 Merc 450SLC   15.2     8 276.    180  3.07  3.78  18       0     0     3     3
#> 3 Lincoln Con…  10.4     8 460     215  3     5.42  17.8     0     0     3     4
#> 4 Fiat 128      32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 5 AMC Javelin   15.2     8 304     150  3.15  3.44  17.3     0     0     3     2
#> 6 Fiat X1-9     27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> 7 Lotus Europa  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
#> # … with 1 more variable: is_training <lgl>

^{Created on 2021-03-30 by the reprex package (v1.0.0)}

R targets multiple file outputs

1 Answers1