In R targets, cannot read target object of class "dataset"

Question

I am struggling with interoperability of R packages torch and targets. For example, if I define a target of class dataset (from torch), then it is impossible to read it with tar_read (from targets), and I cannot use it in other targets.

Here is my dataset generator nn_dataset:

library(torch)
library(targets)
library(dplyr)
library(tidymodels)

nn_dataset <- 
  dataset(
    name = "nn_dataset",
    
    initialize = function(df) {
      data <- self$prepare_data(df)
      
      self$tele <- data$x$tele
      self$class <- data$x$class
      self$y <- data$y
    },
    
    .getitem = function(i) {
      list(
        x = list(
          tele = self$tele[i, ], 
          class = self$class[i, ]
        ), 
        y = self$y[i, ]
      )
    },
    
    .length = function() {
      self$y$size()[[1]]
    },
    
    prepare_data = function(df) {
      target_col <- 
        df$claim_ind_cov_1_2_3_4_5_6 %>% 
        as.integer() %>%
        `-`(1) %>%
        as.matrix()
      
      tele_cols <- 
        df %>%
        select(starts_with(c("h_", "p_", "vmo", "vma"))) %>%
        as.matrix()
    
      class_df <- select(df, expo:years_licensed, distance)
      
      rec_class <-
        recipe(~ ., data = class_df) %>%
        step_impute_median(commute_distance, years_claim_free) %>%
        step_other(all_nominal(), threshold = 0.05) %>%
        step_dummy(all_nominal()) %>%
        prep()

      class_cols <- juice(rec_class) %>% as.matrix()
      
      list(
        x = list(
          tele = torch_tensor(tele_cols),
          class = torch_tensor(class_cols)
        ),
        y = torch_tensor(target_col)
      )
    }
)

If I define the following target:

tar_target(
  name = target_name,
  command = nn_dataset(valid_df)
)

where valid_df is a tibble, and if I then try to read it:

tar_read(target_name)

then I get this error:

Error in cpp_tensor_dim(self$ptr) : external pointer is not valid

I have also tried this:

tar_target(
  name = target_name,
  command = nn_dataset(valid_df),
  format = "torch"  
)

and this:

tar_torch(
  name = target_name,
  command = nn_dataset(valid_df)
)

but neither worked.

`format = "torch"` or `format = tar_format(...)` is definitely recommended for torch objects because of those external pointers. `targets` tries to save and load objects, and torch tensors are not exportable: https://future.futureverse.org/articles/future-4-non-exportable-objects.html. Not sure why it's erroring out when you include `format = "torch"`, and I cannot run your code because it has datasets that I do not have. Please have a look at https://books.ropensci.org/targets/help.html, especially the section on reprexes. — landau, Mar 13 '23 at 18:13
Hello @landau. Thank you very much for your help and for your time! I managed to reproduce the bug with a reprex that you can find at this URL: https://github.com/francisduval/reprex_targets_torch_bug To restore the library from the lockfile, please run renv::restore() (I'm sure you know that :)) — Francis Duval, Mar 14 '23 at 16:15

score 2 · Accepted Answer · answered Mar 14 '23 at 19:15

The format = "torch" capability of targets relies on torch::torch_save() and torch::torch_load(), and these functions in torch do not work on the custom R6 classes that come out of MyDataset(mtcars) in your example. On top of that, torch data is "non-exportable", and as discussed at https://books.ropensci.org/targets/targets.html#saving and https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html, that data cannot simply be saved to disk with something like saveRDS() (which is the default in targets). I do not know torch well enough to recommend something specific, but a solution would require figuring out the R code that will safely save and load one of these objects, then creating your own custom storage format using tar_format(). The code at https://docs.ropensci.org/targets/reference/tar_format.html#ref-examples has an example for Keras models.

A better alternative would actually be to avoid saving R6 objects altogether because those are really pieces of code that do not hash well. If you can restructure the pipeline to save simpler versions of the data and only re-create those R6 classes on an as-needed basis, that would be much better, especially if those R6 classes take no time at all to create from e.g. a data frame. So you first target could be the mtcars data frame, and then the model-fitting target could call MyDataset(mtcars), fit the model, and return easy-to-save output generated from that fitted model.

Thank you for your answer. It really addresses my issue. Moreover, it helped me understand the targets package a little better. Cheers! — Francis Duval, Mar 14 '23 at 20:07
My plan is to avoid saving R6 objects and re-create them on an as-needed basis instead, as recommended. I found that the way to save a trained neural network, i.e. an object of class luz_module_fitted, is with the luz::luz_save function. I will therefore create my own custom format based on that with targets::tar_format. The trained neural network is really the most important thing to save for me since it takes time to run. — Francis Duval, Mar 14 '23 at 20:33

In R targets, cannot read target object of class "dataset"

1 Answers1