0

I am very confused about how the 'classbalancing' pipeline operates, specifically during resampling with 'resample'.

I am performing a binary logistic regression using a large dataset, and would like to downsample my major class while performing training (and ideally testing as well).

When I perform downsampling outside of the 'resample' function, I get the expected result (major class frequencies equal to 5 times the minor class).

Setup

task_lr <- TaskClassif$new('task_lr', backend = data.df, target = 'ls')

Without resample

opb <- po('classbalancing', adjust = 'downsample', reference = 'minor', ratio = 5) %>>% po("encode", method = "treatment") %>>%
  po("scale") %>>% lrn("classif.cv_glmnet", predict_sets = c("test", "train"))
opb$keep_results <- TRUE
opbLearner <- as_learner(opb)
opbResult <- opbLearner$train(task_lr)
opbResult$graph$pipeops$classbalancing$.result$output

Outputted graph data has dimensions of 702 x 16 as desired

With resample

resampleTest <- mlr3::resample(task = task_lr, opbLearner, resampling = rsmp("subsampling", ratio = 0.7, repeats = 1))
resampleTest$learners[[1]]$graph$pipeops$classbalancing$.result$output

Outputted graph data has dimensions of 2541608 x 16

EDIT

When performing with a resampling ratio of 1, it gives the correct dataset size under the graph.

resampleTest <- mlr3::resample(task = task_lr, opbLearner, resampling = rsmp("subsampling", ratio = 1, repeats = 1))
resampleTest$learners[[1]]$graph$pipeops$classbalancing$.result$output

Output is 702 x 16 in dimensions as desired

My question comes down to if it is possible to perform the resampling as done in the resample function on the dataset after the pipe operators have modified it. However, it is still confusing to me how the resample function works as I want when a subsampling ratio of 1 is given.

0 Answers0