Getting Weird Results From Sklearn SVM Models

Question

I am attempting to train a support vector machine from scikit-learn, but I dont seem to be getting any results and Im wondering if any svm or scikit learn experts out there might know why. Here is the example I am running. I have some hand written digit data and I want to train a classifier to distinguish 'a' from 'b'. The data I used is here so you can test it out too. Both files, the training file and the test file are in that archive. Any help understanding the results (the svm says everything is an 'a') would be greatly appreciated.

Here is my script:

#!/usr/bin/env python

import os
import re
from sklearn import svm

def get_record(line):
    match = re.search("^(\S+) (\d+)", line)
    label = match.group(1)
    vector = list(match.group(2))
    vector = [int(x) for x in vector]
    return label, vector

def train_classifier():
    classifier = svm.SVC()
    data = open("sd19-train-binary.txt", "r")
    labels = []
    training_data = [] 
    i = 0
    for line in data:
        label, vector = get_record(line) 
        if label == 'a' or label == 'b': 
            labels.append(label)
            training_data.append(vector)
            i += 1
            if i > 100:
                break
    classifier.fit(training_data, labels) 
    return classifier

def test_classifier(classifier):
    data = open("sd19-test-binary.txt", "r")
    i = 0
    for line in data:
        label, vector = get_record(line)
        if label == 'a' or label == 'b':
            print label, classifier.predict(vector)
            i += 1
            if i > 100:
                break

def main():
    classifier = train_classifier()
    test_classifier(classifier)


main()

Do any elements get picked up in `training_data`? It looks like there might be a problem in `re.search("^(\S+) (\d+)", line)` - the regex should probably be `r"^(\S+) (\d+)"`. — user1149913, Apr 07 '13 at 19:02
Yep they do. If you want you can try the code and print the values. — David Williams, Apr 07 '13 at 19:13

score 2 · Accepted Answer · answered Apr 08 '13 at 12:44

2

By default SVC uses an RBF kernel. Without setting / cross-validating gamma and C you can not expect meaningful results.

answered Apr 08 '13 at 12:44

Andreas Mueller

27,470
8
62
74

Getting Weird Results From Sklearn SVM Models

1 Answers1