0

I am attempting to train a support vector machine from scikit-learn, but I dont seem to be getting any results and Im wondering if any svm or scikit learn experts out there might know why. Here is the example I am running. I have some hand written digit data and I want to train a classifier to distinguish 'a' from 'b'. The data I used is here so you can test it out too. Both files, the training file and the test file are in that archive. Any help understanding the results (the svm says everything is an 'a') would be greatly appreciated.

Here is my script:

#!/usr/bin/env python

import os
import re
from sklearn import svm

def get_record(line):
    match = re.search("^(\S+) (\d+)", line)
    label = match.group(1)
    vector = list(match.group(2))
    vector = [int(x) for x in vector]
    return label, vector

def train_classifier():
    classifier = svm.SVC()
    data = open("sd19-train-binary.txt", "r")
    labels = []
    training_data = [] 
    i = 0
    for line in data:
        label, vector = get_record(line) 
        if label == 'a' or label == 'b': 
            labels.append(label)
            training_data.append(vector)
            i += 1
            if i > 100:
                break
    classifier.fit(training_data, labels) 
    return classifier

def test_classifier(classifier):
    data = open("sd19-test-binary.txt", "r")
    i = 0
    for line in data:
        label, vector = get_record(line)
        if label == 'a' or label == 'b':
            print label, classifier.predict(vector)
            i += 1
            if i > 100:
                break

def main():
    classifier = train_classifier()
    test_classifier(classifier)


main()
David Williams
  • 8,388
  • 23
  • 83
  • 171

1 Answers1

2

By default SVC uses an RBF kernel. Without setting / cross-validating gamma and C you can not expect meaningful results.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74