I am creating a pipeline that allows for an arbitrary number of dataset names to be put in, where they will all be put through similar cleaning processes. To do this, I am using the targets
package, and using the tar_map
function from tarchetypes
, I subject each dataset to a series of tidying and wrangling functions.
My issue now is that one dataset needs to be split into three datasets by a factor (a la split
) while the rest should remain untouched. The pipeline would then theoretically move on by processing each dataset individually, including the three 'daughter' datasets.
Here's my best attempt:
library(targets)
library(tarchetypes)
library(tidyverse)
# dir.create("./data")
# tibble(nums = 1:300, groups = rep(letters[1:3], each = 100)) |>
# write_csv("./data/td1.csv")
# tibble(nums = 301:600, groups = rep(letters[1:3], each = 100)) |>
# write_csv("./data/td2.csv")
# tibble(nums = 601:900, groups = rep(letters[1:3], each = 100)) |>
# write_csv("./data/td3.csv")
tar_option_set(
packages = c("tidyverse")
)
read_data <- function(paths) {
read_csv(paths)
}
get_group <- function(data, groups) {
filter(data, groups == groups)
}
do_nothing <- function(data) {
data
}
list(
map1 <- tar_map(
values = tibble(datasets = c("./data/td1.csv", "./data/td2.csv", "./data/td3.csv")),
tar_target(data, read_data(datasets)),
map2 <- tar_map(values = tibble(groups = c("a", "b", "c")),
tar_skip(tester, get_group(data, groups), !str_detect(tar_name(), "td3\\.csv$"))
),
tar_target(dn, do_nothing(list(data, tester)))
)
)
The skipping method is a bit clumsy, I may be thinking about that wrong as well.
I'm obviously trying to combine the code poorly at the end there by putting them in a list, but I'm at a loss as to what else to do.
The datasets can't be combined by, say, rbind
, since in actuality they are SummarizedExperiment
objects.
Any help is appreciated - let me know if any further clarification is needed.