3

I'm trying to use bootstrapping resampling as my cross-validation in mlr3, and have been tracking down the cause of an error:

Error in as_data_backend.data.frame(backend, primary_key = row_ids) : Assertion on 'primary_key' failed: Contains duplicated values, position 2.

The position changes (likely the first repeated row). Based on the error message I first thought it was an issue having rownames included, so I set those as the col_type$name, and also tried removing rownames from the data before creating the task (no luck!).

In trying to create a reprex, I narrowed it down to transform pipe operators like 'scale' and 'pca' as the cause:

library("mlr3verse")

task <- tsk('sonar')

pipe = po('scale') %>>%
  po(lrn('classif.rpart'))

ps <- ParamSet$new(list(
  ParamDbl$new("classif.rpart.cp", lower = 0, upper = 0.05)
))

glrn <- GraphLearner$new(pipe) 

glrn$predict_type <- "prob"

bootstrap <- rsmp("bootstrap", ratio = 1, repeats = 5)

instance <- TuningInstanceSingleCrit$new(
  task = task,
  learner = glrn,
  resampling = bootstrap,
  measure = msr("classif.auc"),
  search_space = ps,
  terminator = trm("evals", n_evals = 100)
)

tuner <- tnr("random_search")
tuner$optimize(instance)

I've also tried grid search instead of random, different learners, including the flag "duplicated_ids = TRUE" in rsmp, with no luck. Changing to CV cross validation, however, does fix the problem.

For reference, in the full pipe/graph I am trying different feature filters and learners to identify candidate pipelines.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72

0 Answers0