3

Goal

  1. Create a LASSO model using MLR3
  2. Use nested CV with inner CV or bootstraps for hyperparameter (lambda) determination and outer CV for model performance evaluation (instead of doing just one test-train spit) and finding the standard deviation of the different LASSO regression coefficients amongst the different model instances.
  3. Do a prediction on a testing data set not available yet.

Issues

  1. I am unsure whether the nested CV approach as described is implemented correctly in my code below.
  2. I am unsure whether alpha is set correctly alpha = 1 only.
  3. I do not know how to access the LASSO lamda coefficients when using resampling in mlr3. (importance() in mlr3learners does not yet support LASSO)
  4. I don't know how to apply a possible model to the unavailable testing set in mlr3.

Code

library(readr)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(reprex)

# Data ------
# Prepared according to the Blog post by Julia Silge
# https://juliasilge.com/blog/lasso-the-office/
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
data = read_csv(url(urlfile))[-1]
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#>   .default = col_double()
#> )
#> See spec(...) for full column specifications.

# Add a factor to data
data$factor = as.factor(c(rep('a', 20), rep('b', 50), rep('c', 30), rep('a', 6), rep('c', 10), rep('b', 20)))

# Task creation
task = 
  TaskRegr$new(
    id = 'office', 
    backend = data, 
    target = 'imdb_rating'
  )

# Model creation
graph = 
  po('scale') %>>% 
  po('encode') %>>% # make factors numeric
  # How to normalize predictors, leaving target unchanged?
  lrn('regr.cv_glmnet', # 10-fold CV for inner loop. Is alpha permanently set to 1?
       id = 'rp', alpha = 1, family = 'gaussian'
     ) 

graph_learner = GraphLearner$new(graph)

# Execution (actual modeling)
result = 
  resample(
    task, 
    graph_learner, 
    rsmp('cv', folds = 5) # 5-fold for outer CV 
  )
#> INFO  [13:21:53.485] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 3/5) 
#> INFO  [13:21:54.937] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 2/5) 
#> INFO  [13:21:55.242] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 1/5) 
#> INFO  [13:21:55.500] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 4/5) 
#> INFO  [13:21:55.831] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 5/5)

# How to access results, i.e. lamda coefficients, 
# and compare them (why no variable importance for glmnet)

# Access prediction
result$prediction()
#> <PredictionRegr> for 136 observations:
#>     row_id truth response
#>          2   8.3 8.373798
#>          6   8.7 8.455151
#>          9   8.4 8.358964
#> ---                      
#>        116   9.7 8.457607
#>        119   8.2 8.130352
#>        128   7.8 8.224150

Created on 2020-06-11 by the reprex package (v0.3.0)

Edit 1 (LASSO coefficients)

According to a comment from missuse LASSO coefficients can be accessed through result$data$learner[[1]]$model$rp$model$glmnet.fit$beta Additionally, I found that store_models = TRUE needs to be set in result to store the model and in turn access the coefficients.

  • despite setting alpha = 1, I optained multiple LASSO coefficients. I would like the 'best' LASSO coefficients (stemming from e. g. from lamda = lamda.min or lamda.1se). What do the different s1, s2, s3, ... mean? Are these different lamdas?
  • The different coefficients indeed seem to stem from different lambda values denoted as s1, s2 , s3, ... (Numer is index.) I suppose, the 'best' coefficients can be accessed by first finding the indices of the 'best' lambda index_lamda.1se = which(ft$lambda == ft$lambda.1se)[[1]]; index_lamda.min = which(ft$lambda == ft$lambda.min)[[1]] and then finding the set of coefficients. A more concise approach to find the 'best' coefficients is given in the comments by missuse.
library(readr)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(reprex)
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
data = read_csv(url(urlfile))[-1]

# Add a factor to data
data$factor = as.factor(c(rep('a', 20), rep('b', 50), rep('c', 30), rep('a', 6), rep('c', 10), rep('b', 20)))

# Task creation
task = 
  TaskRegr$new(
    id = 'office', 
    backend = data, 
    target = 'imdb_rating'
  )

# Model creation
graph = 
  po('scale') %>>% 
  po('encode') %>>% # make factors numeric
  # How to normalize predictors, leaving target unchanged?
  lrn('regr.cv_glmnet', # 10-fold CV for inner loop. Is alpha permanently set to 1?
      id = 'rp', alpha = 1, family = 'gaussian'
  ) 

graph$keep_results = TRUE
graph_learner = GraphLearner$new(graph)

# Execution (actual modeling)
result = 
  resample(
    task, 
    graph_learner, 
    rsmp('cv', folds = 5), # 5-fold for outer CV 
    store_models = TRUE # Store model needed to acces coefficients
  )

# LASSO coefficients 
# Why more than one coefficient per predictor? 
# What are s1, s2 etc.? Shouldn't 'lrn' fix alpha = 1?
# How to obtain the best coefficient (for lamda 1se or min) if multiple?
as.matrix(result$data$learner[[1]]$model$rp$model$glmnet.fit$beta)
#>                  s0          s1         s2         s3          s4         s5
#> andy              0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> angela            0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> b_j_novak         0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> brent_forrester   0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> darryl            0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> dwight            0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> episode           0 0.000000000 0.00000000 0.00000000 0.010297763 0.02170423
#> erin              0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> gene_stupnitsky   0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> greg_daniels      0 0.000000000 0.00000000 0.00000000 0.001845101 0.01309437
#> jan               0 0.000000000 0.00000000 0.00000000 0.005663699 0.01357832
#> jeffrey_blitz     0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> jennifer_celotta  0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> jim               0 0.006331732 0.01761548 0.02789682 0.036853510 0.04590513
#> justin_spitzer    0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> [...]
#>                           s6           s7           s8          s9         s10
#> andy             0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> angela           0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> b_j_novak        0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> brent_forrester  0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> darryl           0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> dwight           0.002554576  0.007006995  0.011336058  0.01526851  0.01887180
#> episode          0.031963475  0.040864492  0.047487987  0.05356482  0.05910066
#> erin             0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> gene_stupnitsky  0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> greg_daniels     0.023040791  0.031866343  0.040170917  0.04779004  0.05472702
#> jan              0.021030152  0.028094541  0.035062678  0.04143812  0.04725379
#> jeffrey_blitz    0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> jennifer_celotta 0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> jim              0.053013058  0.058503984  0.062897112  0.06683734  0.07041964
#> justin_spitzer   0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> kelly            0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> ken_kwapis       0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> kevin            0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> lee_eisenberg    0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> michael          0.057190859  0.062963830  0.068766981  0.07394472  0.07865977
#> mindy_kaling     0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> oscar            0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> pam              0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> paul_feig        0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> paul_lieberstein 0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> phyllis          0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> randall_einhorn  0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> ryan             0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> season           0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> toby             0.000000000  0.000000000  0.005637169  0.01202893  0.01785309
#> factor.a         0.000000000 -0.003390125 -0.022365768 -0.03947047 -0.05505681
#> factor.b         0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#> factor.c         0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
#>                           s11          s12          s13           s14
#> andy              0.000000000  0.000000000  0.000000000  0.0000000000
#> angela            0.000000000  0.000000000  0.000000000  0.0000000000
#> b_j_novak         0.000000000  0.000000000  0.000000000  0.0000000000
#> brent_forrester   0.000000000  0.000000000  0.000000000  0.0000000000
#> darryl            0.000000000  0.000000000  0.000000000  0.0017042281
#> dwight            0.022170870  0.025326337  0.027880703  0.0303865693
#> episode           0.064126846  0.069018240  0.074399623  0.0794693480
#> [...]

Created on 2020-06-15 by the reprex package (v0.3.0)

Edit 2 (optional follow up question)

Nested CV provides discrepancy-evalutation amongst multiple models. The discrepancy can be expressed as an error (e.g. RMSE) obtained by the outer CV. While that error may be small, individual LASSO coefficients (importance of predictors) from the models (instanciated by the outer CV) may vary considerably.

  1. Does mlr3 provide functionality describing the consitancy in quantitative importance of predictor variables, i. e. RMSE of LASSO coefficients amongst models created by the outer CV? Or should a custom function be created, retrieving the LASSO coefficients using result$data$learner[[i]]$model$rp$model$glmnet.fit$beta (suggested by missuse) with i = 1, 2, 3, 4, 5 being the folds of the outer CV and then taking RMSE of the matching coefficients?
Shudras
  • 117
  • 2
  • 8
  • 4. it is for this learner since it performs CV internally. 5. it is. 6. `result$data$learner[[1]]$model$rp$model$glmnet.fit$beta` for the first fold. 7 in general you do not want to predict using models built in nested CV, since you do not have one model but rather 5 since you used 5 folds in the outer resampling loop. You want to build one model on all of the train data and use that to predict on new data. Nested CV is used for evaluation mostly. – missuse Jun 15 '20 at 06:45
  • @missuse Edit 1: if ```alpha = 1``` is set correctly, why do I obtain different LASSO coefficients for same predictors using ```result$data$learner[[1]]$model$rp$model$glmnet.fit$beta```? How would I select for coefficients stemming from the 'best' lambda (e.g. min or 1se)? How What's the meaning of s1, s2 etc, different lamdas? (LASSO doesn't know an 's'-parameter.) Edit 2: I added a concern evaluating the model discrepancy obtained by outer CV. – Shudras Jun 15 '20 at 12:04
  • this answer will be useful: https://stackoverflow.com/questions/47626830/using-glmnet-to-predict-a-continuous-variable-in-a-dataset/47627205#47627205, You can extract the coefficients at best lambda using: `lapply(1:5, function(i) coef(result$data$learner[[i]]$model$rp$model, s = "lambda.min"))`. Each time you fit a model on different data you will get a different model. Nested CV evaluates the variance in performance due to this. Sorry I did not mention `store_models = TRUE` previously. Glad to see you overcame it. – missuse Jun 15 '20 at 12:15
  • 1
    @Shudras It sounds like you solved your problem? It would be great if you could post an answer here. – pat-s Nov 06 '20 at 21:58
  • Note: the above method for accessing coefficients is very slow for me within lapply. Using the method per MLR3 documentation (https://mlr3.mlr-org.com/reference/ResampleResult.html) speeds it up tremendously: ```rrdt <- as.data.table(rr, reassemble_learners = TRUE, convert_predictions = TRUE, predict_sets = "test")``` then ```coef(rrdt$learner[[i]]$model$rp$model)``` – Michael Connor Aug 05 '21 at 06:08

0 Answers0