0

I was trying to classify a dataset using the following strategy:

  • Leave one out cross validation
  • KNN to classify (count the number of errors) for each "fold"
  • Calculate final error
  • repeat for k=[1,2,3,4,5,7,10,12,15,20]

Here's the code, for the fisheriris dataset:

load fisheriris
cur=meas;true_label=species;

for norm=0:2
    feats=normalizamos(cur,norm); %this is just a function I use in my dataset
                                  for normalization. norm=0 equals no normalization
                                  norm=1 and norm=2 are two different normalizations

    c=cvpartition(size(feats,1),'leaveout');

    for k=[1,2,3,4,5,7,10,12,15,20]

        clear n_erros
        for i=1:c.NumTestSets
            tr=c.training(i);te=c.test(i);

            train_set=feats(tr,:);
            test_set=feats(te,:);

            train_class=true_label(tr);
            test_class=true_label(te);

            pred=knnclassify(test_set,train_set,train_class,k);
            n_erros(i)=sum(~strcmp(pred,test_class));
        end

        err_rate=sum(n_erros)/sum(c.TestSize)
    end
end

Since the results (for my dataset) showed strange incoherent values, I decided to write my own version of LOO, as follows:

for i=1:size(cur,1)     

    test_set=feats(i,:);
    test_class=true_label(i);

    if i==1
        train_set=feats(i+1:end,:);
        train_class=true_label(i+1:end);
    else
        train_set=[feats(1:i-1,:);feats(i+1:end,:)];
        train_class=[true_label(1:i-1);true_label(i+1:end)];
    end

    pred=knnclassify(test_set,train_set,train_class,k);
    n_erros(i)=sum(~strcmp(pred,test_class));
end

Assuming my version of the code is well written, I was hoping for the same, or at least similar results. Here are both outcomes:

Results

Any idea why the results are so different? What version should I use? Now I'm thinking to rewrite the other tests I did (for 3-fold, 5-fold, etc...) just to be sure.

Thank you all

  • Matlabs functions are pretty solid. I don't know enough of the topic to judge, but if you say that the results by matlabs method contradict your own, there are 2 likely suspects. First, a mistake in your own method. Second, a mistake in how you use the matlab method. ---- In order to assist you it would help if you could strip the code to the bare minimum (of what you require to demonstrate the problem) and include some sample input for which the problem occurs. – Dennis Jaheruddin Oct 08 '14 at 12:35
  • In MATLAB case, it randomly takes one sample as test set and all the other ones as train set, and repeats the process until all samples are used as test sets. In MY case the only difference is that the test set is selected orderly and not randomly, but all the samples are also used and the result is the mean of all tests, so the outcome should be the same... the problem isn't in my dataset, because the results I showed are for fisheriris. I just perceived the incoherence because the results for my DataSet didn't make sense in the problem's context – Pedro Álvaro Chagas Oct 08 '14 at 12:46
  • 1
    Have you tried multiplying your results by 4? Just kidding, (but not really). Is there any scaling somewhere that could lead to all values being off by a factor of 4? – Stewie Griffin Oct 08 '14 at 12:49
  • I didn't realize fisheriris was included in matlab by default, my mistake. However, I guess that due to the observation of @RobertP. It should not be very hard to figure out where the difference comes from. – Dennis Jaheruddin Oct 08 '14 at 13:02
  • Ahah, you were actually right, 4 is the number of features of fisheriris. In my dataset results always appear scaled by the number of features so the problem is just a mere case of division in the end, when calculating the error! Thank you all, and sorry for this silly question! – Pedro Álvaro Chagas Oct 08 '14 at 13:31

0 Answers0