0

Using the R MASS package to do a linear discriminant analysis, is there a way to get a measure of variable importance?

Library(MASS)
### import data and do some preprocessing
fit <- lda(cat~., data=train)

I have is a data set with about 20 measurements to predict a binary category. But the measurements are hard to obtain so I want to reduce the number of measurements to the most influential.

When using rpart or randomForests I can get a list of variable importance, or a gimi decrease stat using summary() or importance().

Is there a built in function to do this that I cant find? Or if I have to code one, what would be a good way to go about it?

Bach
  • 6,145
  • 7
  • 36
  • 61
Nick
  • 115
  • 2
  • 3
  • 7
  • You should know the statistical method well enough to know if such a measure exists. If you need help extracting it in R, that's more of a programming question. If your question is about the theory of a statistical method, it probably belongs on http://stats.stackexchange.com – MrFlick May 28 '14 at 01:57

2 Answers2

1

I would recommend to use the "caret" package.

library(caret)
data(mdrr)
mdrrDescr <- mdrrDescr[, -nearZeroVar(mdrrDescr)]
mdrrDescr <- mdrrDescr[, -findCorrelation(cor(mdrrDescr), .8)]
set.seed(1)
inTrain <- createDataPartition(mdrrClass, p = .75, list = FALSE)[,1]
train <- mdrrDescr[ inTrain, ]
test  <- mdrrDescr[-inTrain, ]
trainClass <- mdrrClass[ inTrain]
testClass  <- mdrrClass[-inTrain]

set.seed(2)
ldaProfile <- rfe(train, trainClass,
                  sizes = c(1:10, 15, 30),
                  rfeControl = rfeControl(functions = ldaFuncs, method = "cv"))


postResample(predict(ldaProfile, test), testClass)

Once the variable "ldaProfile" is created you can retrieve the best subset of variables and its description:

ldaProfile$optVariables
[1] "X5v"    "VRA1"   "D.Dr06" "Wap"    "G1"     "Jhetm"  "QXXm"  
[8] "nAB"    "H3D"    "nR06"   "TI2"    "nBnz"   "Xt"     "VEA1"  
[15] "TIE"

Also you can get a nice plot of used variables vs. Accuracy.

Esteban PS
  • 929
  • 1
  • 8
  • 12
  • 1
    I was looking for something similar, stumbled across this post, dug a little, and found out that using ldaFuncs calls caret::filterVarImp which computes importances based on ROC AUC computed using cutoffs in each feature separately (which is nifty). Regardless though, using rfe with ldaFuncs doesn't derive some measure of feature importance directly from the LDA models themselves (I don't believe that's possible). – Eric Czech Feb 18 '16 at 02:01
0

One option would be to employ permutation importance.

Fit the LDA model then randomly permute each feature column with a different column chosen at random and compare the resulting prediction score with baseline (non-permuted) score.

The more the permuted score is reduced relative to the baseline score, the more important that feature is. Then you can select a cutoff and take only those features for which the permuted score - baseline score is above the given threshold.

There is a nice tutorial on kaggle for this topic. It uses python instead of R, but the concept is directly applicable here.

https://www.kaggle.com/dansbecker/permutation-importance

wmsmith
  • 542
  • 4
  • 15