I have a small dataset I am trying to train random forest and lasso models on. If I run cv.glmnet from glmnet, it performs k-fold CV and finds optimal lambda within 1 second. If I set up an AutoTuner from MLR3 and train it in the same manner, it spends a long time computing. I have specified the search space and resolution for MLR3 to match that from cv.glmnet.
start_time <- Sys.time()
cv_model <- cv.glmnet(x, y, nfolds = 5, alpha = 1, family="binomial", type.measure = "deviance", keep = FALSE)
end_time <- Sys.time()
end_time - start_time
Time difference of 0.8357668 secs
task =
TaskClassif$new(
id = 'test',
backend = na.omit(test[, ..keep_all]),
target = 'tox'
)
lrn_glmnet <- lrn("classif.glmnet", predict_type="prob")
measure = msr("deviance")
resampling = rsmp("cv", folds = 5)
search_space = ps(s = p_dbl(lower = min(cv_model$lambda), upper = max(cv_model$lambda)))
terminator = trm("none")
tuner = tnr("grid_search", resolution=100)
at = AutoTuner$new(lrn_glmnet, resampling, measure, terminator, tuner, search_space)
start_time <- Sys.time()
at$train(task)
end_time <- Sys.time()
end_time - start_time
Time difference of 1.107895 mins
The difference here is approximately 80-fold, in terms of computation time. I'd really like to use MLR3 since I'm doing nested CV and it's easy with that and I can benchmark learners together and not rely on the internal working of another package, but the two orders of magnitude are a very high price to pay (eg. for 100 iterations of 5 outer folds, I am looking at roughly 500 minutes vs. 500 seconds).
Am I doing something wrong? Any suggestions on improving speed here? Thanks!