How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)

Question

I am trying to teach my SVM algorithm using data of clicks and conversion by people who see the banners. The main problem is that the clicks is around 0.2% of all data so it's big disproportion in it. When I use simple SVM in testing phase it always predict only "view" class and never "click" or "conversion". In average it gives 99.8% right answers (because of disproportion), but it gives 0% right prediction if you check "click" or "conversion" ones. How can you tune the SVM algorithm (or select another one) to take into consideration the disproportion?

Could you tell more about what do you mean with up-sampling? — rvnikita, Aug 06 '13 at 21:22
possible duplicate of [sklearn logistic regression with unbalanced classes](http://stackoverflow.com/questions/14863125/sklearn-logistic-regression-with-unbalanced-classes) — Fred Foo, Aug 07 '13 at 07:31

lejlot · Accepted Answer · 2013-09-18T09:09:13.053

The most basic approach here is to use so called "class weighting scheme" - in classical SVM formulation there is a C parameter used to control the missclassification count. It can be changed into C1 and C2 parameters used for class 1 and 2 respectively. The most common choice of C1 and C2 for a given C is to put

C1 = C / n1
C2 = C / n2

where n1 and n2 are sizes of class 1 and 2 respectively. So you "punish" SVM for missclassifing the less frequent class much harder then for missclassification the most common one.

Many existing libraries (like libSVM) supports this mechanism with class_weight parameters.

Example using python and sklearn

print __doc__

import numpy as np
import pylab as pl
from sklearn import svm

# we create 40 separable points
rng = np.random.RandomState(0)
n_samples_1 = 1000
n_samples_2 = 100
X = np.r_[1.5 * rng.randn(n_samples_1, 2),
          0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y = [0] * (n_samples_1) + [1] * (n_samples_2)

# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)

w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]


# get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10})
wclf.fit(X, y)

ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]

# plot separating hyperplanes and samples
h0 = pl.plot(xx, yy, 'k-', label='no weights')
h1 = pl.plot(xx, wyy, 'k--', label='with weights')
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()

pl.axis('tight')
pl.show()

In particular, in sklearn you can simply turn on the automatic weighting by setting class_weight='auto'.

Visualization of above code from sklearn documentation

Thank you very much, it's what I am looking for. I wish I had 15 points to vote for this answer :) — rvnikita, Aug 06 '13 at 21:20
I am pretty sure that you can still check the "accept answer" option :) — lejlot, Aug 07 '13 at 14:26

denson · Answer 2 · 2015-10-09T16:04:06.887

1

This paper describes a variety of techniques. One simple (but very bad method for SVM) is just replicating the minority class(s) until you have a balance:

http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf

edited Oct 09 '15 at 16:04

answered Jun 12 '15 at 21:31

denson

2,366
2
24
25

Just for completeness - replicating minority class should **never** be used in SVM. It is equivalent to using class weights, while in the same time is completely inefficient in terms of training (and testing) times. – lejlot Oct 07 '15 at 08:52
I edited my original answer to reflect lejlot's comment. – denson Oct 09 '15 at 16:05

How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)

2 Answers2

Linked