Using glmnet to find optimized model for a given number of predictors

Question

I'm attempting to use LASSO for a slightly different function than originally designed. I have a 22 different tasks on a test, which, when averaged, produce a final score. I want to see which combination of a limited number of tasks would best predict the overall score with the hopes of creating a short form of this test.

I'm using glmnet to run the lasso next, and it runs as expected. I can then easily find the model at a given lamda with

coef(cvfit, s = s)

However, I am wondering if it would be possible to specify the n of predictors that have non-zero coefficients, rather than the penalization parameter?

I've set up a very inefficient way of doing this as shown below by extracting the models from a grid of test lambdas, but I was wondering if there is a more efficient way of doing this

nvar <- list()
coeffs <- list()

for(j in 1:20000) {

  s <- j / 20000

  coeffs[j] <- coef(cvfit, s = s) ##Get coefficient list at given lamda

  nvar[j] <- sum(as.vector(coef(cvfit, s = s)) != 0) - 1 ##Count number of variables with non-zero coeff and subtract one because intercept is always non-zero

}

nvar <- unlist(nvar)

getlamda <- function(numvar = 4) {

  min.lambda <- min(lambdas[nvar == numvar]) / 20000 ##Find the smallest lambda which resulted in the given number of non-zero coefficients

  coeffs[min.lambda]

}

score 1 · Answer 1 · answered Jan 10 '19 at 05:21

You can use rowSums().

(boston <- MASS::Boston %>% tbl_df())
#> # A tibble: 506 x 14
#>       crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
#>  *   <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl>   <dbl>
#>  1 0.00632  18    2.31     0 0.538  6.58  65.2  4.09     1   296    15.3
#>  2 0.0273    0    7.07     0 0.469  6.42  78.9  4.97     2   242    17.8
#>  3 0.0273    0    7.07     0 0.469  7.18  61.1  4.97     2   242    17.8
#>  4 0.0324    0    2.18     0 0.458  7.00  45.8  6.06     3   222    18.7
#>  5 0.0690    0    2.18     0 0.458  7.15  54.2  6.06     3   222    18.7
#>  6 0.0298    0    2.18     0 0.458  6.43  58.7  6.06     3   222    18.7
#>  7 0.0883   12.5  7.87     0 0.524  6.01  66.6  5.56     5   311    15.2
#>  8 0.145    12.5  7.87     0 0.524  6.17  96.1  5.95     5   311    15.2
#>  9 0.211    12.5  7.87     0 0.524  5.63 100    6.08     5   311    15.2
#> 10 0.170    12.5  7.87     0 0.524  6.00  85.9  6.59     5   311    15.2
#> # ... with 496 more rows, and 3 more variables: black <dbl>, lstat <dbl>,
#> #   medv <dbl>

For above dataset (Boston housing), consider medv ~ ..

library(glmnet)
tr_x <- model.matrix(medv ~ ., data = boston)[,-1]
tr_y <- boston$medv
cvfit <- glmnet(tr_x, tr_y)

head(t(coef(cvfit)))
    #> 6 x 14 sparse Matrix of class "dgCMatrix"
    #>    [[ suppressing 14 column names '(Intercept)', 'crim', 'zn' ... ]]
    #>                                                        
    #> s0 22.53281 . . . . . .         . . . . . .  .         
    #> s1 23.60072 . . . . . .         . . . . . . -0.08439977
    #> s2 23.67264 . . . . . 0.1278413 . . . . . . -0.15358093
    #> s3 21.44649 . . . . . 0.5694424 . . . . . . -0.19698136
    #> s4 19.42057 . . . . . 0.9714620 . . . . . . -0.23654740
    #> s5 17.57464 . . . . . 1.3377669 . . . . . . -0.27259852

I guess you have done this procedure.

Remarks

It might be convenient to transpose the coefficient matrix so that each variable becomes each column.
For t(coef(cvfit)), rowSums(t(coef(cvfit)) != 0) finds the number of non-zero element for each variable.
Next, we just match numvar with this rowSums and find the value of coefficient.

Denote that from s0 to s5, lambda s0 is larger than s5 - more penalized.

head(cvfit$lambda)
#> [1] 6.777654 6.175546 5.626927 5.127046 4.671574 4.256564

Subset coef with numvar

Based on these facts,

get_nparam <- function(mod, numvar) {
  beta <- coef(mod)
  non_zero <- rowSums(t(beta)[,-1] != 0) # ignore intercept
  min_lam <- which(non_zero == numvar) # numvar non-zero coef
  t(beta)[dplyr::last(min_lam),] # last index = smallest lambda
}

With this function, you can get

get_nparam(cvfit, 4)
#>  (Intercept)         crim           zn        indus         chas 
#> 15.468034114  0.000000000  0.000000000  0.000000000  0.000000000 
#>          nox           rm          age          dis          rad 
#>  0.000000000  3.816165372  0.000000000  0.000000000  0.000000000 
#>          tax      ptratio        black        lstat 
#>  0.000000000 -0.606026131  0.001518042 -0.495954410

rm, ptratio, black, and lstat are non-zero while the others zero.

Thank you so much for the detailed reply with clear step-by-step instructions and especially the replicable example. I think I figured out an even simpler solution after playing with your code, which I will post below. — jsizzle, Jan 10 '19 at 15:54

score 1 · Accepted Answer · answered Jan 10 '19 at 16:06

After playing with Blended's solution above, I realized that there is an even simpler way of doing this.

Using the Boston dataset used in the example:

library(dplyr)
library(glmnet)

(boston <- MASS::Boston %>% tbl_df())

tr_x <- model.matrix(medv ~ ., data = boston)[,-1]
tr_y <- boston$medv
cvfit <- glmnet(tr_x, tr_y)

The cvfit object already has all the components we need to find the answer for a given number of variables. df is the number of degrees of freedom, and is the number of variable parameter in which we are interested. lambda is the lambda for each model.

So we can create a simple function that returns the best model for a given number of variables.

get_nparam <- function(mod, numvar) {

  coef(mod, s = with(cvfit, min(lambda[df == numvar])))

}

get_nparam(cvfit, 4)

#14 x 1 sparse Matrix of class "dgCMatrix"
#                       1
#(Intercept) 15.468034114
#crim         .          
#zn           .          
#indus        .          
#chas         .          
#nox          .          
#rm           3.816165372
#age          .          
#dis          .          
#rad          .          
#tax          .          
#ptratio     -0.606026131
#black        0.001518042
#lstat       -0.495954410
#

Thanks again to Blender for a different solution and putting me on the path to this.

Using glmnet to find optimized model for a given number of predictors

2 Answers2

Remarks

Subset coef with numvar