1

I have a small dataset I am trying to train random forest and lasso models on. If I run cv.glmnet from glmnet, it performs k-fold CV and finds optimal lambda within 1 second. If I set up an AutoTuner from MLR3 and train it in the same manner, it spends a long time computing. I have specified the search space and resolution for MLR3 to match that from cv.glmnet.

start_time <- Sys.time()
cv_model <- cv.glmnet(x, y, nfolds = 5, alpha = 1, family="binomial", type.measure = "deviance", keep = FALSE)
end_time <- Sys.time()
      
end_time - start_time

Time difference of 0.8357668 secs

task = 
        TaskClassif$new(
          id = 'test', 
          backend = na.omit(test[, ..keep_all]), 
          target = 'tox'
        )

lrn_glmnet <- lrn("classif.glmnet", predict_type="prob")
measure = msr("deviance")
resampling = rsmp("cv", folds = 5)
search_space = ps(s = p_dbl(lower = min(cv_model$lambda), upper = max(cv_model$lambda)))
terminator = trm("none")
tuner = tnr("grid_search", resolution=100)

at = AutoTuner$new(lrn_glmnet, resampling, measure, terminator, tuner, search_space)

start_time <- Sys.time()      
at$train(task)
end_time <- Sys.time()

end_time - start_time

Time difference of 1.107895 mins

The difference here is approximately 80-fold, in terms of computation time. I'd really like to use MLR3 since I'm doing nested CV and it's easy with that and I can benchmark learners together and not rely on the internal working of another package, but the two orders of magnitude are a very high price to pay (eg. for 100 iterations of 5 outer folds, I am looking at roughly 500 minutes vs. 500 seconds).

Am I doing something wrong? Any suggestions on improving speed here? Thanks!

Michael Connor
  • 473
  • 1
  • 4
  • 9
  • 1
    If you want to tune the lambda parameter, the mlr3learners package also ships with the `classif.cv_glmnet` learner. The next version will include documentation to make this more visible. – Michel Aug 04 '21 at 09:46
  • Thank you. Is it possible to do stratified nested k-fold CV if cv.glmnet is embedded within another resample() of MLR3? I can manually specify outer folds, but I would think that cv.glmnet is dynamically called for each one, making it difficult to manually specify the inner ones? – Michael Connor Aug 05 '21 at 17:52
  • @Michael Connor when you use `cv.glmnet` with `mlr3` nested resampling then you are actually using triple nested resampling. The inner most cv.glmnet would tune the lambda (regularization penalty), the middle loop would tune the alpha (autotuner that tunes ratio of lasso and ridge) and the outer loop would evaluate performance. – missuse Aug 05 '21 at 19:03
  • Thanks, I think I’m doing it right. If I wrap cv.glmnet with resample() then that is double or nested, correct? But how to specify inner fold IDs? – Michael Connor Aug 05 '21 at 19:17

0 Answers0