1

I'm looking into machine learning and am using LDA as a starting point. I'm following MatLab's own tutorial on LDA classification (Here).

They are using the fisheriris dataset which is already available in Matlab and can simply be loaded. In the tutorial they use this line for classification:

ldaClass = classify(meas(:,1:2),meas(:,1:2),species);

I know that:

classify trains a classifier based on the training data and labels (second and third argument), and applies the classifier to the test data (first argument) and ldaClass gives the classes chosen for the test data points, based on the classifier that has been trained using the training data points and labels

So the same data that was given to the classifier for training was used for testing as well?

My understanding of supervised learning is that once a system is trained with a training set then it should be tested with an unknown sample to test its prediction. And if it is given a test sample from the training set then it should be able to correctly classify it. Right?

Then how come in classify when the same sample is used for training and testing it misclassified 20% of the labels?

Either LDA is a very poor classifier or I am not understanding the concept here completely. Help me out please.

Amro
  • 123,847
  • 25
  • 243
  • 454
StuckInPhDNoMore
  • 2,507
  • 4
  • 41
  • 73

2 Answers2

1

The data set is not linearly separable (see the blue and green points in your link). LDA basically divides the data by straight lines in the 2D case; as there is no line that can perfectly separate the blue and green training points, there will always be some misclassifications.

This explains the misclassification rate. In general, a lot of classifiers will have trouble with that kind of data. Because of the overlap of two of the classes, there will either be severe overfitting or some residual error in the training set.

anderas
  • 5,744
  • 30
  • 49
  • 1
    I just want to point out that the `classify` function supports both [linear](https://en.wikipedia.org/wiki/Linear_discriminant_analysis) and [quadratic](https://en.wikipedia.org/wiki/Quadratic_classifier) discriminant classifiers (the default being linear). In the case of QDA, the classes are separated by a quadratic surface, not just a straight line or a (hyper-)plane. – Amro Jul 17 '14 at 20:57
1

You're right that in a real-world situation, best practice is to train a classifier on one sample and evaluate it on another - and also that if you evaluate the classifier on the training sample, that will give you a biased (over-optimistic) estimate of the classifier's accuracy.

However, you're reading a tutorial, which is attempting to teach you the correct syntaxes to use when applying classify, rather than trying to teach you best practices in statistical learning. Note that the tutorial is fairly explicit about this - it emphasises that the error rate it is calculating is the resubstitution error rate (i.e. the over-optimistic one calculated on the training sample).

But you're not correct to assume that whenever you evaluate a classifier on the sample it was trained on, it will be able to correctly classify all samples; that's not at all true. In this case, two of the classes overlap significantly and the classifier is unable to completely separate them, which gives rise to the 20% error.

That doesn't mean that LDA is a poor classifier; it means it is a simple model that is unable to find twists and turns that would completely separate the two overlapping classes. Simple models are bad when the data has complex relationships; they are good when the relationships are simple, and also when the relationships are complex but the data is noisy enough that a complex model would fit to the noise rather than the complex relationship.

Sam Roberts
  • 23,951
  • 1
  • 40
  • 64