10

I am extracting image features from 10 classes with 1000 images each. Since there are 50 features that I can extract, I am thinking of finding the best feature combination to use here. Training, validation and test sets are divided as follows:

Training set = 70%
Validation set = 15%
Test set = 15%

I use forward feature selection on the validation set to find the best feature combination and finally use the test set to check the overall accuracy. Could someone please tell me whether I am doing it right?

Gastón Bengolea
  • 855
  • 6
  • 13
klijo
  • 15,761
  • 8
  • 34
  • 49

2 Answers2

28

So kNN is an exception to general workflow for building/testing supervised machine learning models. In particular, the model created via kNN is just the available labeled data, placed in some metric space.

In other words, for kNN, there is no training step because there is no model to build. Template matching & interpolation is all that is going on in kNN.

Neither is there a validation step. Validation measures model accuracy against the training data as a function of iteration count (training progress). Overfitting is evidenced by the upward movement of this empirical curve and indicates the point at which training should cease. In other words, because no model is built, there is nothing to validate.

But you can still test--i.e., assess the quality of the predictions using data in which the targets (labels or scores) are concealed from the model.

But even testing is a little different for kNN versus other supervised machine learning techniques. In particular, for kNN, the quality of predictions is of course dependent upon amount of data, or more precisely the density (number of points per unit volume)--i.e., if you are going to predict unkown values by averaging the 2-3 points closest to it, then it helps if you have points close to the one you wish to predict. Therefore, keep the size of the test set small, or better yet use k-fold cross-validation or leave-one-out cross-validation, both of which give you more thorough model testing but not at the cost of reducing the size of your kNN neighbor population.

doug
  • 69,080
  • 24
  • 165
  • 199
  • 5
    but since i need to find the best feature combination shouldn't i perform this search on the validation set and then finally test the selected best features with the test test – klijo May 30 '12 at 13:43
  • 3
    if i run the best feature selection algorithm on the test set and then get the final accuracy, wouldnt this make the feature combination biased towards the test set ? – klijo May 30 '12 at 14:02
  • 2
    @klijo the canonical kNN description does not include an algorithm for feature selection, or anything of the sort. Apart from that i don't understand the question in either of your comments, but i'm certain they have nothing to do with kNN. – doug May 30 '12 at 20:25
  • I this answer this fundamentally misses the concept of model validation in a way that is quite common. Probably the best way to think about it is in terms of the information space; if you narrow the information space to provide evidence for your predicted outcome, you need to make sure it generalizes to **unseen** data. "Unseen data" implies data that was not used to narrow the information space. That is true whether there is a model or not. – Renel Chesak May 21 '20 at 10:13
  • 1
    The test set should never be used to **select** (to narrow the information space; select features; tune parameters; participate in any way in the process of training / model construction); **that is what the validation set is for**. If you used your test set for feature selection, parameter tuning (`k` in KNN), and/or model selection, that data is no longer "unseen"; it was used to select, to narrow the information space. **The test set should only ever be used to report the final metics on your chosen model.** – Renel Chesak May 21 '20 at 10:23
  • Suppose you trained/validated three different models. Then you ran each of those models on the same test set. Then you chose the model with the best test set score. This act of choosing a model is an act of biasing the results, isn't it? At the end of the day, it seems like you can't really produce unbiased results unless you have yet ANOTHER hold out dataset. In this case, some type of 4 way split. – user2253546 May 20 '21 at 22:07
4

kNN is not trained. All of the data is kept and used at run-time for prediction, so it is one of the most time and space consuming classification method. Feature reduction can reduce these problems. Cross validation is a much better way of testing then train/test split.

Yasir Khan
  • 41
  • 4