10

I am an SVM newbie and this is my use case: I have a lot of unbalanced data to be binary classified using a linear SVM. I need to fix the false positives rate at certain values and measure the corresponding false negatives for each value. I am using something like the following code making use of scikit-learn svm implementation:

# define training data
X = [[0, 0], [1, 1]]
y = [0, 1]

# define and train the SVM
clf = svm.LinearSVC(C=0.01, class_weight='auto') #auto for unbalanced distributions
clf.fit(X, y)

# compute false positives and false negatives
predictions = [clf.predict(ex) for ex in X]    
false_positives = [(a, b) for (a, b) in zip(predictions,y) if a != b and b == 0]
false_negatives = [(a, b) for (a, b) in zip(predictions,y) if a != b and b == 1] 

Is there a way to play with a parameter (or a few parameters) of the classifier such that one the measurement metrics is effectively fixed?

rae1
  • 6,066
  • 4
  • 27
  • 48
Rabih Kodeih
  • 9,361
  • 11
  • 47
  • 55

2 Answers2

18

The class_weights parameter allows you to push this false positive rate up or down. Let me use an everyday example to illustrate how this work. Suppose you own a night club, and you operate under two constraints:

  1. You want as many people as possible to enter the club (paying customers)
  2. You do not want any underage people in, as this will get you in trouble with the state

On an average day, (say) only 5% percent of the people attempting to enter the club will be underage. You are faced with a choice: being lenient or being strict. The former will boost your profits by as much as 5%, but you are running the risk of an expensive lawsuit. The latter will inevitably mean some people who are just above the legal age will be denied entry, which will cost you money too. You want to adjust the relative cost of leniency vs strictness. Note: you cannot directly control how many underage people enter the club, but you can control how strict your bouncers are.

Here is a bit of Python that shows what happens as you change the relative importance.

from collections import Counter
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC

data = load_iris()

# remove a feature to make the problem harder
# remove the third class for simplicity
X = data.data[:100, 0:1] 
y = data.target[:100] 
# shuffle data
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X = X[indices, :]
y = y[indices]

for i in range(1, 20):
    clf = LinearSVC(class_weight={0: 1, 1: i})
    clf = clf.fit(X[:50, :], y[:50])
    print i, Counter(clf.predict(X[50:]))
    # print clf.decision_function(X[50:])

Which outputs

1 Counter({1: 22, 0: 28})
2 Counter({1: 31, 0: 19})
3 Counter({1: 39, 0: 11})
4 Counter({1: 43, 0: 7})
5 Counter({1: 43, 0: 7})
6 Counter({1: 44, 0: 6})
7 Counter({1: 44, 0: 6})
8 Counter({1: 44, 0: 6})
9 Counter({1: 47, 0: 3})
10 Counter({1: 47, 0: 3})
11 Counter({1: 47, 0: 3})
12 Counter({1: 47, 0: 3})
13 Counter({1: 47, 0: 3})
14 Counter({1: 47, 0: 3})
15 Counter({1: 47, 0: 3})
16 Counter({1: 47, 0: 3})
17 Counter({1: 48, 0: 2})
18 Counter({1: 48, 0: 2})
19 Counter({1: 48, 0: 2})

Note how the number of data points classified as 0 decreases are the relative weight of class 1 increases. Assuming you have the computational resources and time to train and evaluate 10 classifiers, you can plot the precision and recall of each one and get a figure like the one below (shamelessly stolen off the internet). You can then use that to decide what the right value of class_weights is for your use case.

Precision-recall tradeoff

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
  • 1
    @mbatchkarov you can always change the decision boundary of a classifier. – Matti Lyra Jan 29 '14 at 11:10
  • 1
    That's a way I know of, not entirely sure if it's the **standard** way. @MattiLyra describes the other thing you could do, which is faster as you train just one model. – mbatchkarov Jan 29 '14 at 11:39
10

The predict method for LinearSVC in sklearn looks like this

def predict(self, X):
    """Predict class labels for samples in X.

    Parameters
    ----------
    X : {array-like, sparse matrix}, shape = [n_samples, n_features]
        Samples.

    Returns
    -------
    C : array, shape = [n_samples]
        Predicted class label per sample.
    """
    scores = self.decision_function(X)
    if len(scores.shape) == 1:
        indices = (scores > 0).astype(np.int)
    else:
        indices = scores.argmax(axis=1)
    return self.classes_[indices]

So in addition to what mbatchkarov suggested you can change the decisions made by the classifier (any classifier really) by changing the boundary at which the classifier says something is of one class or the other.

from collections import Counter
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC

data = load_iris()

# remove a feature to make the problem harder
# remove the third class for simplicity
X = data.data[:100, 0:1] 
y = data.target[:100] 
# shuffle data
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X = X[indices, :]
y = y[indices]

decision_boundary = 0
print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8))
Counter({1: 27, 0: 23})

decision_boundary = 0.5
print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8))
Counter({0: 39, 1: 11})

You can optimize the decision boundary to be anything depending on your needs.

Matti Lyra
  • 12,828
  • 8
  • 49
  • 67