4

As is commonly known glmnet can be used as a tool for feature selection. A toy example:

library(glmnet)
# Binomial dataset, the number of classes is 2
data(BinomialExample)
# data truncation to 10 columns, just to make the example dataset smaller
x <- BinomialExample$x[,1:10] 
y <- BinomialExample$y
cvfit = cv.glmnet(x, y, family = "binomial")
coefs <- coef(cvfit)

The coefs variable shows which features have been selected (in the example all features except V1 and V7). This result is clear and understandable.

> coefs
11 x 1 sparse Matrix of class "dgCMatrix"
                    s1
(Intercept)  0.1048257
V1           .        
V2           0.5901863
V3          -0.4060696
V4          -0.9627180
V5          -0.1067188
V6          -0.7813739
V7           .        
V8          -0.4106554
V9           0.5733065
V10         -1.0492793

The problem is how to interpret the output results if the number of classes is more than two. A toy example:

# Multinomial, the number of classes is 3
data(MultinomialExample)
x <- MultinomialExample$x[,1:10] 
y <- MultinomialExample$y
cvfit = cv.glmnet(x, y, family = "multinomial")
coefs <- coef(cvfit)

Now coefs stores three propositions of features to be selected.
Question: which set should be used as the best set of features?
In other words: is it possible to use glmnet as a feature selection tool at all, when we have more than 2 classes?

> coefs
$`1`
11 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept) -0.03279324
V1           .         
V2          -0.08585827
V3           0.40882396
V4          -0.08639670
V5          -0.15763031
V6           0.22513768
V7           .         
V8           0.17657623
V9           .         
V10          .         

$`2`
11 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept)  0.01255996
V1          -0.21913800
V2           .         
V3           .         
V4           .         
V5           0.41329881
V6           .         
V7           .         
V8           .         
V9          -0.57131512
V10          0.52214739

$`3`
11 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept)  0.02023328
V1           0.09163282
V2           0.42655929
V3           .         
V4           0.29403632
V5           .         
V6          -0.12306560
V7           .         
V8          -0.44815059
V9           0.88580234
V10         -0.20920812
artgram
  • 144
  • 4

2 Answers2

3

You need to use grouped LASSO penalty, by setting type.multinomial = "grouped". Then you will see the same zero/nonzero pattern in coefficients for all classes.

library(glmnet)
data(MultinomialExample)  ## see "Note" below
#x <- MultinomialExample$x  ## see "Note" below
#y <- MultinomialExample$y  ## see "Note" below
cvfit <- cv.glmnet(x, y, family = "multinomial", type.multinomial = "grouped")
coef(cvfit)

Note:

The structure of built-in data has changed. I am using glmnet 4.1.1 and simply need

data(MultinomialExample)

But you are using glmnet 4.1.4 and need

data(MultinomialExample)
x <- MultinomialExample$x
y <- MultinomialExample$y
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
1

I don't know about the exact parameterisation used by glmnet but multinomial regression can be represented by a set of binomial regressions, where the number of regressions depends on the number of classes.

cv.glmnet does regularisation to perform what you refer to as "feature selection" (Lasso regularisation in your example) for these individual regressions which is why you obtain 3 sparse matrices of the regularised coefficients for each model, where some are set to zero.

Think of it this way: selected features may vary across regressions because for some classes a feature may have predicted power, for others is has not and you don't want it in the respective regression. Regularisation helps with that. In this regard, the procedure does not look for a "global" best set of regressors. Other types of regularisation take a different route, see this post on stats.stackexchange.

Martin C. Arnold
  • 9,483
  • 1
  • 14
  • 22