0

I have a dataset with both continuous and categorical variables. I am running regression to predict one of the variables based on the other variables in the dataset. After comparing the results of ridge, lasso and elastic-net regression, the lasso regression is the best model to proceed with.

I used the 'coef' function to extract the model's coefficients, however, the result is a very long list with over 800 variables (as some of my categorical variables have many levels). Is there a way I can quickly rank the coefficients from largest to smallest? This is a glmnet model output

Reproducible problem with example code:

# Libraries Needed
library(caret)
library(glmnet)
library(mlbench)
library(psych)

# Data
data("BostonHousing")
data <- BostonHousing
str(data)

# Data Partition
set.seed(222)
ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]

# Custom Control Parameters
custom <- trainControl(method = "repeatedcv",
                       number = 10,
                       repeats = 5,
                       verboseIter = T)

# Linear Model
set.seed(1234)
lm <- train(medv ~.,
            train,
            method='lm',
            trControl = custom)

# Results
lm$results
lm
summary(lm)
plot(lm$finalModel)

# Ridge Regression
set.seed(1234)
ridge <- train(medv ~.,
               train,
               method = 'glmnet',
               tuneGrid = expand.grid(alpha = 0,
                                      lambda = seq(0.0001, 1, length=5)),#try 5 values for lambda between 0.0001 and 1
                                      trControl=custom)
#increasing lambda = increasing penalty and vice versa
#increase lambda therefore will cause coefs to shrink

# Plot Results
plot(ridge)
plot(ridge$finalModel, xvar = "lambda", label = T)
plot(ridge$finalModel, xvar = 'dev', label=T)
plot(varImp(ridge, scale=T))

# Lasso Regression
set.seed(1234)
lasso <- train(medv ~.,
               train,
               method = 'glmnet',
               tuneGrid = expand.grid(alpha=1,
                                      lambda = seq(0.0001,1, length=5)),
               trControl = custom)

# Plot Results
plot(lasso)
lasso
plot(lasso$finalModel, xvar = 'lambda', label=T)
plot(lasso$finalModel, xvar = 'dev', label=T)
plot(varImp(lasso, scale=T))

# Elastic Net Regression
set.seed(1234)
en <- train(medv ~.,
            train,
            method = 'glmnet',
            tuneGrid = expand.grid(alpha = seq(0,1,length=10),
                                   lambda = seq(0.0001,1,length=5)),
            trControl = custom)

# Plot Results
plot(en)
plot(en$finalModel, xvar = 'lambda', label=T)
plot(en$finalModel, xvar = 'dev', label=T)
plot(varImp(en))

# Compare Models
model_list <- list(LinearModel = lm, Ridge = ridge, Lasso = lasso, ElasticNet=en)
res <- resamples(model_list)
summary(res)
bwplot(res)
xyplot(res, metric = 'RMSE')

# Best Model
en$bestTune
best <- en$finalModel
coef(best, s = en$bestTune$lambda)
dMh
  • 27
  • 6

1 Answers1

0

For most models all you'd have to do would be:

sort(coef(model), decreasing=TRUE)

Since you're using glmnet it's a little bit more complicated. I'm going to replicate a minimal version of your example here (the other models, plots, etc. are not necessary in order for us to be able to reproduce your problem ...)

## Packages
library(caret)
library(glmnet)
library(mlbench) ## for BostonHousing data
# Data
data("BostonHousing")
data <- BostonHousing
# Data Partition
set.seed(222)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]
# Custom Control Parameters
custom <- trainControl(method = "repeatedcv",
                       number = 10,
                       repeats = 5,
                       verboseIter = TRUE)
# Elastic Net Regression
set.seed(1234)
en <- train(medv ~.,
            train,
            method = 'glmnet',
            tuneGrid = expand.grid(alpha = seq(0,1,length=10),
                                   lambda = seq(0.0001,1,length=5)),
            trControl = custom)
# Best Model
best <- en$finalModel
coefs <- coef(best, s = en$bestTune$lambda)

(This could probably be made simpler: for example, do you really need the custom control parameters to show us the example? This would be even simpler without using caret - just using `glmnet - but I was afraid I might leave something out.)

Once you've got the coefficients, sorting does appear to work, albeit with a message about possible inefficiency:

sort(coefs, decreasing=TRUE)
## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient
##  [1]  25.191049410   5.078589706   1.389548822   0.244605193   0.045600250
##  [6]   0.008840485   0.004372752  -0.012701593  -0.028337745  -0.162794401
## [11]  -0.335062819  -0.901475516  -1.395091095 -12.632336419

sort(as.numeric(coefs)) also appears to work fine.

If you want to sort the entire matrix (i.e. keeping the values for all penalization levels), you can take advantage of the fact that the penalization doesn't change the rank-order of the parameters:

coeftab <-coef(best)
lastvals <- coeftab[,ncol(coeftab)]
coeftab_s <- coeftab[order(lastvals,decreasing=TRUE),]
## plot, leaving out the intercept
matplot(t(coeftab_s)[,-1],type="l")
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • my coefficients are stored in a dgCMatrix so the above is not working for me: [1] "dgCMatrix" attr(,"package") [1] "Matrix" – dMh May 31 '20 at 19:43
  • that's the kind of information you should show us in a [mcve] ... can you let us know at the very least what `str()` of your coefficients looks like? (What package/code are you using that returns coefficients in this format?) – Ben Bolker May 31 '20 at 19:45
  • is this by any chance a `glmnet` model output? https://stackoverflow.com/questions/47958170/coerce-model-coefficients-to-clean-2-column-dataframe – Ben Bolker May 31 '20 at 19:50
  • `str()` of coefficients (not able to showcase all with character cap): 'data.frame': 9203 obs. of 36 variables: ` $ Area_of_Origin : Factor w/ 72 levels "11 - Lobby, Entranceway",..: 7 25 9 7 9 51 45 45 45 7 ... $ Building_Status : Factor w/ 8 levels "","01 - Normal (no change)",..: 2 7 2 2 3 2 2 2 2 2 ...` Yes, this is a glmnet model output. I am using lasso regression – dMh May 31 '20 at 19:50
  • 1
    Can you edit your question to include this? Also, **please** tell us more about the modeling framework you're using. (What you're describing here looks more like input data than coefficient output ...????) – Ben Bolker May 31 '20 at 19:52
  • I think I know how to answer this, but I'll wait for you to edit your question with more description [i.e. that you're using `glmnet`] and a [mcve]. (Hint: the very first example in `?glmnet` would make a fine, simple example in this case ... – Ben Bolker May 31 '20 at 20:00
  • Absolutely. I have a dataset with both continuous and categorical variables. I am running regression to predict one of the variables based on the other variables in the dataset. After comparing the results of ridge, lasso and elastic-net regression, the lasso regression is the best model to proceed with. I have now been able to obtain the coefficients but this list is very long because my categorical variables have so many levels – dMh May 31 '20 at 20:00
  • Please **edit your question to include this information** (and the reproducible example) - thanks. – Ben Bolker May 31 '20 at 20:02
  • I've given this a shot. If this doesn't answer your question please edit your question to try to clarify (comments if necessary, but good question-edits are preferred ...) – Ben Bolker May 31 '20 at 20:42