libsvm multiclass classification - does imbalance in training data matter

Question

enter image description here I'm using libsvm in Matlab to classify a dataset with 5 classes. Data here is 2-D but I don't think it matters.

The amount of testing data for each class is balanced. For training data, when I use 5 training sample for each class, the classification result is good. However, when I change the number of training data for a class (say class 2) from 5 to 10, the classification accuracy is poor, especially for class 2.

The code I use is very simple:

model = svmtrain2(trainLabels, trainData ); 
[LabelSVM] = svmpredict2(testLabels, testData, model);

Is that because there's any options in svmtrain2 that I should specify? Or it's caused by something else? Thank you.

See if [this question](http://stackoverflow.com/questions/18078084/how-should-i-teach-machine-learning-algorithm-using-data-with-big-disproportion/18088148#18088148) helps. Basically there are methods for dealing with imbalances in your data set in scikit-learn (built on libsvm), but none it seems directly available through libsvm. You could roll your own, but the scikit-learn options seem to work well. — Engineero, Jul 17 '14 at 22:39
Can you give a 2D plot of the data?Maybe we can extract some useful information about how you can solve your problem. — Darkmoor, Jul 21 '14 at 20:48
Hi, I uploaded the data, it is nothing special. Circled data are the training ones. I still didn't figure out why a small imbalance in training data causes serious problem. — Z Cao, Jul 24 '14 at 19:05

score 0 · Answer 1 · answered Jul 18 '14 at 04:51

Take a look at this svm guide, from LIBSVM. It's a pretty good introduction - for a even quicker solution see section 1.2 (though you're better off reading the whole thing, if you haven't already).

Basically, make sure you've scaled your data (both testing and training at the same time) and you've got to tune your kernel parameter(s), which is(are) probably C and γ.

I also think that if you only have 5 data points per class, you won't get very reliable performance. It's quite easy for the SVM to over-fit the data.

Hi, in fact my data is quite easy to classify (see picture). Even with less than 5 training for each class, LDA can easily get above 95% overall. Also it seems to me that scaling here may not be the problem... — Z Cao, Jul 24 '14 at 19:09

libsvm multiclass classification - does imbalance in training data matter

1 Answers1