I am very confused about how the 'classbalancing' pipeline operates, specifically during resampling with 'resample'.
I am performing a binary logistic regression using a large dataset, and would like to downsample my major class while performing training (and ideally testing as well).
When I perform downsampling outside of the 'resample' function, I get the expected result (major class frequencies equal to 5 times the minor class).
Setup
task_lr <- TaskClassif$new('task_lr', backend = data.df, target = 'ls')
Without resample
opb <- po('classbalancing', adjust = 'downsample', reference = 'minor', ratio = 5) %>>% po("encode", method = "treatment") %>>%
po("scale") %>>% lrn("classif.cv_glmnet", predict_sets = c("test", "train"))
opb$keep_results <- TRUE
opbLearner <- as_learner(opb)
opbResult <- opbLearner$train(task_lr)
opbResult$graph$pipeops$classbalancing$.result$output
Outputted graph data has dimensions of 702 x 16 as desired
With resample
resampleTest <- mlr3::resample(task = task_lr, opbLearner, resampling = rsmp("subsampling", ratio = 0.7, repeats = 1))
resampleTest$learners[[1]]$graph$pipeops$classbalancing$.result$output
Outputted graph data has dimensions of 2541608 x 16
EDIT
When performing with a resampling ratio of 1, it gives the correct dataset size under the graph.
resampleTest <- mlr3::resample(task = task_lr, opbLearner, resampling = rsmp("subsampling", ratio = 1, repeats = 1))
resampleTest$learners[[1]]$graph$pipeops$classbalancing$.result$output
Output is 702 x 16 in dimensions as desired
My question comes down to if it is possible to perform the resampling as done in the resample function on the dataset after the pipe operators have modified it. However, it is still confusing to me how the resample function works as I want when a subsampling ratio of 1 is given.