I am trying to teach my SVM algorithm using data of clicks and conversion by people who see the banners. The main problem is that the clicks is around 0.2% of all data so it's big disproportion in it. When I use simple SVM in testing phase it always predict only "view" class and never "click" or "conversion". In average it gives 99.8% right answers (because of disproportion), but it gives 0% right prediction if you check "click" or "conversion" ones. How can you tune the SVM algorithm (or select another one) to take into consideration the disproportion?
-
Is up-sampling the minority class an option? – Thomas Jungblut Aug 06 '13 at 18:33
-
Could you tell more about what do you mean with up-sampling? – rvnikita Aug 06 '13 at 21:22
-
possible duplicate of [sklearn logistic regression with unbalanced classes](http://stackoverflow.com/questions/14863125/sklearn-logistic-regression-with-unbalanced-classes) – Fred Foo Aug 07 '13 at 07:31
2 Answers
The most basic approach here is to use so called "class weighting scheme" - in classical SVM formulation there is a C
parameter used to control the missclassification count. It can be changed into C1
and C2
parameters used for class 1 and 2 respectively. The most common choice of C1
and C2
for a given C
is to put
C1 = C / n1
C2 = C / n2
where n1
and n2
are sizes of class 1 and 2 respectively. So you "punish" SVM for missclassifing the less frequent class much harder then for missclassification the most common one.
Many existing libraries (like libSVM) supports this mechanism with class_weight parameters.
Example using python and sklearn
print __doc__
import numpy as np
import pylab as pl
from sklearn import svm
# we create 40 separable points
rng = np.random.RandomState(0)
n_samples_1 = 1000
n_samples_2 = 100
X = np.r_[1.5 * rng.randn(n_samples_1, 2),
0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y = [0] * (n_samples_1) + [1] * (n_samples_2)
# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]
# get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10})
wclf.fit(X, y)
ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]
# plot separating hyperplanes and samples
h0 = pl.plot(xx, yy, 'k-', label='no weights')
h1 = pl.plot(xx, wyy, 'k--', label='with weights')
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()
pl.axis('tight')
pl.show()
In particular, in sklearn you can simply turn on the automatic weighting by setting class_weight='auto'
.

- 64,777
- 8
- 131
- 164
-
Thank you very much, it's what I am looking for. I wish I had 15 points to vote for this answer :) – rvnikita Aug 06 '13 at 21:20
-
I am pretty sure that you can still check the "accept answer" option :) – lejlot Aug 07 '13 at 14:26
This paper describes a variety of techniques. One simple (but very bad method for SVM) is just replicating the minority class(s) until you have a balance:
http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf

- 2,366
- 2
- 24
- 25
-
Just for completeness - replicating minority class should **never** be used in SVM. It is equivalent to using class weights, while in the same time is completely inefficient in terms of training (and testing) times. – lejlot Oct 07 '15 at 08:52
-