0

I am attempting to perform a least discriminant analysis on geometric morphometric data. Because geometric morphometric data typically produces large numbers of variables and discriminant analyses require more data points than variables to accurately classify specimens, a common solution in the literature is to perform a principal component analysis and then use a variable number of PCs representing less than 99% of the cumulative variance but returning the highest reclassification rate as input for the LDA.

Right now the way I am doing this is running the LDA in R (using the functions in the Morpho and MASS packages) under every possible number of PCs used and noting the classification accuracy by hand until I found the lowest number of PCs that returned the highest accuracy, but this is highly inefficient.

I was wondering if there was any way to write a function that would run an LDA for all possible numbers of the first N PCs (up to a certain, user defined level representing 99% of the cumulative variance) and return the percent reclassification rate for each level, producing something like the following:

PCs percent_accuracy
20  72.2
19  76.3
18  77.4
17  80.1
16  75.4
15  50.7
... ...
1   20.2

So row 1 would be the reclassification rate when the first 20 PCs are used, row 2 is the rate when the first 19 PCs are used, and so on and so forth.

user2352714
  • 314
  • 1
  • 15
  • This is a bad idea. It guarantees that you will over-determine your model. That means that your model will predict your original data very well, but do substantially worse predicting new data. You need to spend some time with the literature on [statistical learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) before you continue. At least you should use cross-validation on each run. Technically this is simple. Write a function that runs the analysis and computes the percent accuracy. Then use that function in a loop (or with `sapply`) to get the result for each number of PCs. – dcarlson Feb 22 '20 at 05:13
  • The issue is PCAs on morphometric data produce dozens of PCs (my dataset has 94 and it is small by those standards), most of which contribute little to variance and result in having more variables than data points. This in turn results in more degrees of freedom in the measurements than the specimens. (This method has been used in previous studies](https://frontiersinzoology.biomedcentral.com/articles/10.1186/1742-9994-3-15) in this field to reduce the number of PCs to avoid overfitting or violating the assumptions of a discriminant analysis. I have been using cross-validation for each run. – user2352714 Feb 22 '20 at 06:16

0 Answers0