1

I'm training and cross-validating (10-fold) data using libSVM (with linear kernel).

The data consist 1800 fMRI intensity voxels represented as a single datapoint. There are around 88 datapoints in the training-set-file for svm-train.

the training-set-file looks as follow:

+1 1:0.9 2:-0.2 ... 1800:0.1

-1 1:0.6 2:0.9 ... 1800:-0.98

...

I should also mention i'm using the svm-train script (came along with the libSVM package).

The problem is that when running svm-train - it's result as 100% accuracy!

This doesn't seem to reflect the true classification results! The data isn't unbalanced since

#datapoints labeled +1 == #datpoints labeled -1

Iv'e also checked the scaler (scaling correctly), and also tried to change the labels randomly to see how it impacts the accuracy - and it's decreasing from 100% to 97.9%.

Could you please help me understand the problem? If so, what can I do to fix it?

Thanks,

Gal Star

lennon310
  • 12,503
  • 11
  • 43
  • 61
gal.star
  • 11
  • 2
  • I don't think there is a problem. Your SVM can easily give 100% fit for training set, it is perfectly fine. This is called overfitting http://en.wikipedia.org/wiki/Overfitting I think you need to read up on training in-sample and out-of-sample. – sashkello Jan 27 '14 at 23:13
  • This question appears to be off-topic because it is about machine learning. – sashkello Jan 27 '14 at 23:14
  • How can i read up the training in-sample and out-sample? – gal.star Jan 27 '14 at 23:16
  • I mean read some literature on this topic :) This is too large of a problem to outline as an answer, there is a lot of research about proper training and cross-validation. If you don't know what it means, this is what you need to know before doing any coding... – sashkello Jan 27 '14 at 23:17
  • Hi, so basicaly you think that i should be having better results if i'll reduce the amount voxels intensity from 1800 to a smaller amount, maybe by choosing the correct representative voxels? – gal.star Jan 27 '14 at 23:23
  • I do know what training and cross-validation mean :) I'll try to see whether i can choose the best voxels to eliminate an overfitting problem - thank you for your help. – gal.star Jan 27 '14 at 23:37
  • Since you are using linear kernel, 100% result means that your training set is perfectly linearly separable. It may be that your training set is too small. Hand-picking samples will not make situation better, only worse. What is your out-of-sample accuracy? – sashkello Jan 27 '14 at 23:44

1 Answers1

2

Make sure you include '-v 10' in the svmtrain option. I'm not sure your 100% accuracy comes from training sample or validation sample. It is very possible to get a 100% training accuracy since you have much less sample number than the feature number. But if your model suffers from overfitting, the validation accuracy may be low.

lennon310
  • 12,503
  • 11
  • 43
  • 61
  • Thank you for answering:) I have used the -v 10 option. An overfitting could be the problem. Though, should it be causing high results? – gal.star Jan 27 '14 at 23:26
  • It's possible. I would suggest you shrink your region of interest, thus to reduce the voxel number (features), and observe the cross validation result again. – lennon310 Jan 28 '14 at 00:21