1

I am creating a pipeline that allows for an arbitrary number of dataset names to be put in, where they will all be put through similar cleaning processes. To do this, I am using the targets package, and using the tar_map function from tarchetypes, I subject each dataset to a series of tidying and wrangling functions.

My issue now is that one dataset needs to be split into three datasets by a factor (a la split) while the rest should remain untouched. The pipeline would then theoretically move on by processing each dataset individually, including the three 'daughter' datasets.

Here's my best attempt:

library(targets)
library(tarchetypes)
library(tidyverse)

# dir.create("./data")
# tibble(nums = 1:300, groups = rep(letters[1:3], each = 100)) |> 
#   write_csv("./data/td1.csv")
# tibble(nums = 301:600, groups = rep(letters[1:3], each = 100)) |> 
#   write_csv("./data/td2.csv")
# tibble(nums = 601:900, groups = rep(letters[1:3], each = 100)) |> 
#   write_csv("./data/td3.csv")

tar_option_set(
  packages = c("tidyverse")
)

read_data <- function(paths) {
  read_csv(paths)
}

get_group <- function(data, groups) {
  filter(data, groups == groups)
}

do_nothing <- function(data) {
  data
}

list(
  map1 <- tar_map(
    values = tibble(datasets = c("./data/td1.csv", "./data/td2.csv", "./data/td3.csv")),
    tar_target(data, read_data(datasets)),
    map2 <- tar_map(values = tibble(groups = c("a", "b", "c")),
            tar_skip(tester, get_group(data, groups), !str_detect(tar_name(), "td3\\.csv$"))
    ),
    tar_target(dn, do_nothing(list(data, tester)))
  )
)

The skipping method is a bit clumsy, I may be thinking about that wrong as well.

I'm obviously trying to combine the code poorly at the end there by putting them in a list, but I'm at a loss as to what else to do.

The datasets can't be combined by, say, rbind, since in actuality they are SummarizedExperiment objects.

Any help is appreciated - let me know if any further clarification is needed.

Kai Aragaki
  • 377
  • 3
  • 13

1 Answers1

3

If you know the levels of that factor in advance, you can handle the splitting of that third dataset with a separate tar_map() call similar to what you do now. If you do not know the factor levels in advance, then the splitting needs to be handled with dynamic branching, and I recommend something like tarchetypes::tar_group_by().

I do not think tar_skip() is relevant here, and I recommend removing it.

If you start with physical files (or write physical files) then I strongly suggest you track them with format = "file": https://books.ropensci.org/targets/files.html#external-input-files.

library(targets)
library(tarchetypes)
tar_option_set(packages = "tidyverse")

list(
  tar_map(
    values = list(paths = c("data/td1.csv", "data/td2.csv")),
    tar_target(file, paths, format = "file"),
    tar_target(data, read_csv(file, col_types = cols()))
  ),
  tar_target(file3, "data/td3.csv", format = "file"),
  tar_group_by(data3, read_csv(file3, col_types = cols()), groups),
  tar_target(
    data3_row_counts,
    tibble(group = data3$groups[1], n = nrow(data3)),
    pattern = map(data3)
  )
)
landau
  • 5,636
  • 1
  • 22
  • 50
  • Thank you Dr. Landau. The lab decided to go with a different analysis strategy for the time being, so I have yet to implement your solution - but I imagine I will be returning to it in time. – Kai Aragaki Jul 19 '21 at 19:58