0

I am a new R user and I'm using a multinomial regression (i.e. logistic regression with the response variable which has more than 2 classes.) with the function 'vglm' in R. In my dataset there are 11 continuous predictors and 1 response variable which is categorical with 3 classes.

I want to get the best subset for my regression but I don't know how to do it. Is there any function for this or I must do it manually. Because the linear functions don't seem suitable.

I have tried bestglm function but its results don't seem to be suitable for a multinomial regression.

I have also tried a shrinkage method, glmnet which is relative to lasso. It chooses all the variables in the model. But on the other hand the multinomial regression using vglm reports some variables as insignificant.

I've searched a lot on the Internet including this website but haven't found any good answer. So I'm asking here because I need really a help on this. Thanks

Hanie Bk
  • 1
  • 2
  • Most, if not all, of those modeling functions can handle multinomial models but if you are new to R, it is possible that there is a problem with how the data are assembled or with your model specification/interpretation. For example, it's possible that `glmnet` says that the best model is all of the factors, but more likely you are not interpreting the results correctly using a reasonable lambda. You should talk to a statistician because there is a lot to understand in model selection and there are many pitfalls. Even if your R code is correct, model selection is not for beginners to stats. – David Dec 12 '16 at 05:53

1 Answers1

2

There's a few basic steps involved to get what you want:

  • define the model grid of all potential predictor combinations
  • model run all potential combinations of predictors
  • use a criteria (or a set of multiple criteria) to select the best subset of predictors

The model grid can be defined with the following function:

# define model grid for best subset regression
# defines which predictors are on/off; all combinations presented
model.grid <- function(n){
     n.list <- rep(list(0:1), n)
     expand.grid(n.list)
}

For example with 4 variables, we get n^2 or 16 combinations. A value of 1 indicates the model predictor is on and a value of zero indicates the predictor is off:

model.grid(4)
   Var1 Var2 Var3 Var4
1     0    0    0    0
2     1    0    0    0
3     0    1    0    0
4     1    1    0    0
5     0    0    1    0
6     1    0    1    0
7     0    1    1    0
8     1    1    1    0
9     0    0    0    1
10    1    0    0    1
11    0    1    0    1
12    1    1    0    1
13    0    0    1    1
14    1    0    1    1
15    0    1    1    1
16    1    1    1    1

I provide another function below that will run all model combinations. It will also create a sorted dataframe table that ranks the different model fits using 5 criteria. The predictor combo at the top of the table is the "best" subset given the training data and the predictors supplied:

# function for best subset regression
# ranks predictor combos using 5 selection criteria

 best.subset <- function(y, x.vars, data){
 # y       character string and name of dependent variable
 # xvars   character vector with names of predictors
 # data    training data with y and xvar observations

 require(dplyr)
 reguire(purrr)
 require(magrittr)
 require(forecast)

 length(x.vars) %>%
      model.grid %>%
      apply(1, function(x) which(x > 0, arr.ind = TRUE)) %>%
      map(function(x) x.vars[x]) %>%
      .[2:dim(model.grid(length(x.vars)))[1]] %>%
      map(function(x) tslm(paste0(y, " ~ ", paste(x, collapse = "+")), data = data)) %>%
      map(function(x) CV(x)) %>%
      do.call(rbind, .) %>%
      cbind(model.grid(length(x.vars))[-1, ], .) %>%
      arrange(., AICc)
}

You'll see the tslm() function is specified...others could be used such as vglm(), etc. Simply swap in the model function you want.

The function requires 4 installed packages. The function simply configures data and uses the map() function to iterate across all model combinations (e.g. no for loop). The forecast package then supplies the cross-validation function CV(), which has the 5 metrics or selection criteria to rank the predictor subsets

Here is an application example lifted from the book "Forecasting Principles and Practice." The example also uses data from the book, which is found in the fpp2 package.

library(fpp2)

# test the function
y <- "Consumption"
x.vars <- c("Income", "Production", "Unemployment", "Savings")

best.subset(y, x.vars, uschange) 

The resulting table, which is sorted on the AICc metric, is shown below. The best subset minimizes the value of the metrics (CV, AIC, AICc, and BIC), maximizes adjusted R-squared and is found at the top of the list:

   Var1 Var2 Var3 Var4     CV    AIC   AICc    BIC   AdjR2
1     1    1    1    1 0.1163 -409.3 -408.8 -389.9 0.74859
2     1    0    1    1 0.1160 -408.1 -407.8 -391.9 0.74564
3     1    1    0    1 0.1179 -407.5 -407.1 -391.3 0.74478
4     1    0    0    1 0.1287 -388.7 -388.5 -375.8 0.71640
5     1    1    1    0 0.2777 -243.2 -242.8 -227.0 0.38554
6     1    0    1    0 0.2831 -237.9 -237.7 -225.0 0.36477
7     1    1    0    0 0.2886 -236.1 -235.9 -223.2 0.35862
8     0    1    1    1 0.2927 -234.4 -234.0 -218.2 0.35597
9     0    1    0    1 0.3002 -228.9 -228.7 -216.0 0.33350
10    0    1    1    0 0.3028 -226.3 -226.1 -213.4 0.32401
11    0    0    1    1 0.3058 -224.6 -224.4 -211.7 0.31775
12    0    1    0    0 0.3137 -219.6 -219.5 -209.9 0.29576
13    0    0    1    0 0.3138 -217.7 -217.5 -208.0 0.28838
14    1    0    0    0 0.3722 -185.4 -185.3 -175.7 0.15448
15    0    0    0    1 0.4138 -164.1 -164.0 -154.4 0.05246

Only 15 predictor combinations are profiled in the output since the model combination with all predictors off has been dropped. Looking at the table, the best subset is the one with all predictors on. However, the second row uses only 3 of 4 variables and the performance results are roughly the same. Also note that after row 4, the model results begin to degrade. Thats because income and savings appear to be the key drivers of consumption. As these two variables are dropped from the predictors, model performance drops significantly.

The performance of the custom function is solid since the results presented here match those of the book referenced.

A good day to you.

Brad Horn
  • 649
  • 6
  • 12