0

I am fairly new to multivariat statistics and cannot find the answer in the help-Section of R neither in the source of the MASS-package so maybe you can help me.

My data has many predictors (450) and just few observations (~200). I read it is not possiple to calculate a lda due to the necessary inversion of the variance matrix. But just trying it out before knowing this showed it works and gives kinda good results. How to explain that? Does lda forehand select the variables with the highest seperation impact?

I'm using the caret package to add a 5 fold cv and seperate beforehand into train(0.8) and test(0.2) data.

Validierung <- trainControl(method = "cv", number = 5)
ldaFit1 <- train(`Species` ~., data= train,
             method= "lda",
             trControl = Validierung,
             metric = "Accuracy")  
Vinícius Félix
  • 8,448
  • 6
  • 16
  • 32
bienexo
  • 35
  • 6

1 Answers1

1

LDA has an internal mechanism to reduce the number of features into a few important latent variables:

Like PCA, LDA uses linear combinations of the predictors to create new axes which are used for the final classification. Unlike PCA, it tries to maximize the differences between the groups whereas PCA does not care about the labels and maximizes the total variance instead.

Furthermore, the coefficient will be set constant, if the variance of a variable is lower than a tolerance threshold (option tol in MASSS::lda).

The features are weighted by multiplying the raw data with the scaling coefficients matrix to get the data in the LDA transformed space. Sepal.Length is the most useful feature to discriminate between the species (Highest absolute value of LD1 in the scaling matrix) and the second LDA axis is almost not important at all (Proportion of trace):

library(MASS)

model <- lda(Species ~ ., iris)
model
#> Call:
#> lda(Species ~ ., data = iris)
#> 
#> Prior probabilities of groups:
#>     setosa versicolor  virginica 
#>  0.3333333  0.3333333  0.3333333 
#> 
#> Group means:
#>            Sepal.Length Sepal.Width Petal.Length Petal.Width
#> setosa            5.006       3.428        1.462       0.246
#> versicolor        5.936       2.770        4.260       1.326
#> virginica         6.588       2.974        5.552       2.026
#> 
#> Coefficients of linear discriminants:
#>                     LD1         LD2
#> Sepal.Length  0.8293776  0.02410215
#> Sepal.Width   1.5344731  2.16452123
#> Petal.Length -2.2012117 -0.93192121
#> Petal.Width  -2.8104603  2.83918785
#> 
#> Proportion of trace:
#>    LD1    LD2 
#> 0.9912 0.0088
model$scaling
#>                     LD1         LD2
#> Sepal.Length  0.8293776  0.02410215
#> Sepal.Width   1.5344731  2.16452123
#> Petal.Length -2.2012117 -0.93192121
#> Petal.Width  -2.8104603  2.83918785

Created on 2021-10-04 by the reprex package (v2.0.1)

danlooo
  • 10,067
  • 2
  • 8
  • 22
  • thank you! this helps a lot, do you have a reference for this to cite? – bienexo Oct 04 '21 at 06:01
  • The threshold is part of Mass documentation. LDA is a generic method. So here we do not have to cite papers – danlooo Oct 04 '21 at 06:08
  • I agree! But I would like to know more about the "internal mechanism to reduce the number of features into a few latent variables". Thank you in advance! :) – bienexo Oct 04 '21 at 07:09
  • LDA is explained e.g. in https://www.youtube.com/watch?v=azXCzI57Yfc – danlooo Oct 04 '21 at 07:14
  • @bienexo I revised my answer. – danlooo Oct 04 '21 at 07:27
  • do you also know why the scaling still gives me results for all of my features then, even if it selected just some? – bienexo Oct 04 '21 at 09:55
  • Its like PCA. n features give you n axes but you can ignroe the most because they are not important. – danlooo Oct 04 '21 at 09:59
  • ah okay, thanks! Is there a possibility to extract the relevant features? – bienexo Oct 04 '21 at 10:12
  • @bienexo I revised my answer. You want to look at the scaling coefficients – danlooo Oct 04 '21 at 11:21
  • It's me again, sorry. I was checking the code of lda in MASS to see how the value is calculated which is compared to the tolerance and I found out my variables are not getting reduced by this if-function and while going through the calculation step by stepf it gives my an error with not having conformable arguments, but in total it works.... Do you have an idea why? – bienexo Oct 06 '21 at 06:19
  • You get non-comformable arguments, e.g. because in Matrix multiplication the number of columns from the left matrix must be equal to the number of rows in the right matrix: `matrix(c(1,2,3)) %*% matrix(c(1,2,3))` vs `matrix(c(1,2,3)) %*% t(matrix(c(1,2,3)))` – danlooo Oct 06 '21 at 07:30
  • got this! but why is the `lda()` funktion itself is working then? Should it not bring the same error? – bienexo Oct 07 '21 at 07:59
  • To answer this, one need to look into the source code. You can run `traceback()` after the error to see which function is causing troubles. – danlooo Oct 07 '21 at 08:09
  • it says collinear variables, but other than that no error while performing `lda()` I am even getting pretty good results out of it. That is why I am wondering if those results are trustworthy or not... – bienexo Oct 07 '21 at 11:28
  • Colinliaritry refers to a high correlation beteen many of your covariates. This mean that most of them are redundant, because they behave pretty much the same. This means that it's easy to plot your points in lower dimensions. For more details about the validity, please ask a question on the CrossValidated platform. In this case it might be important how your actual data is distributed. – danlooo Oct 07 '21 at 12:01