4

I am trying to build a simple SVM document classifier using scikit-learn and I am using the following code :

import os

import numpy as np

import scipy.sparse as sp

from sklearn.metrics import accuracy_score

from sklearn import svm

from sklearn.metrics import classification_report

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn import cross_validation
from sklearn.datasets import load_svmlight_file

clf=svm.SVC()

path="C:\\Python27"


f1=[]

f2=[]
data2=['omg this is not a ship lol']

f=open(path+'\\mydata\\ACQ\\acqtot','r')

f=f.read()

f1=f.split(';',1085)

for i in range(0,1086):

    f2.append('acq')



f1.append('shipping ship')

f2.append('crude')    

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1)
counter = CountVectorizer(min_df=1)


x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.fit_transform(data2)

num_sample,num_features=x_train.shape

test_sample,test_features=x_test.shape

print("#samples: %d, #features: %d" % (num_sample, num_features)) #samples: 5, #features: 25
print("#samples: %d, #features: %d" % (test_sample, test_features))#samples: 2, #features: 37

y=['acq','crude']

#print x_test.n_features

clf.fit(x_train,f2)


#den= clf.score(x_test,y)
clf.predict(x_test)

It gives the following error :

(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 6 should be equal to 9451, the number of features at training time

But what I am not understanding is why does it expect the no. of features to be the same? If I am entering an absolutely new text data to the machine which it needs to predict, it's obviously not possible that every document will have the same number of features as the data which was used to train it. Do we have to explicitly set the no of features of the test data to be equal to 9451 in this case?

finitenessofinfinity
  • 989
  • 5
  • 13
  • 24
  • Possible duplicate of [Scikit learn - Random Forest Classifier](http://stackoverflow.com/questions/21998008/scikit-learn-random-forest-classifier) and several earlier questions. – Fred Foo Mar 24 '14 at 09:00
  • @larsmans "Several earlier Questions" ....Like? And the one question you did mention led to one of the answers which happens to be the same as this one but the issue described by OP was different. Same answers doesn't mean the question is same. – finitenessofinfinity Mar 24 '14 at 09:08
  • I couldn't find those earlier questions, but I know I've given the answer "use `transform`, not `fit_transform`" before. The issue is practically the same, use of `fit_transform` on the test set. – Fred Foo Mar 24 '14 at 09:11
  • @larsmans Either mention those questions you've answered while marking as duplicate or stop going around flagging other person's question. – finitenessofinfinity Mar 24 '14 at 09:13
  • [Here's one](http://stackoverflow.com/q/15422487/166749), though not answered by me specifically. You seem to be taking this as a personal attack, which it's not. – Fred Foo Mar 24 '14 at 09:22
  • @larsmans You could have just mentioned this one before. Thanks. I am not taking it personally, I've just seen some very over-zealous people on here who love to go around flagging questions without offering any explanations. Thanks again. – finitenessofinfinity Mar 24 '14 at 09:25
  • It's just that I had to search for it, which honestly is a pain on this website, but you're right that I should prove my claims. Sorry about that. I'm trying to avoid duplicate questions/answers pertaining to scikit-learn, because having the question at least *linked* makes it easier to find them :) – Fred Foo Mar 24 '14 at 10:04

3 Answers3

15

To ensure that you have the same feature representation, you should not fit_transform your test data, but only transform it.

x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.transform(data2)

A similar transformation into homogeneous features should be applied to your labels.

ssoler
  • 4,884
  • 4
  • 32
  • 33
emiguevara
  • 1,359
  • 13
  • 26
  • 3
    To add some conceptual understanding to this answer: the test *is* supposed to have the same number of features as the training set; the words that don't occur in it simply have a value of zero. – Fred Foo Mar 24 '14 at 09:01
3

SVM works by assuming all of your training data lives in an n-dimensional space and then performing a kind of geometric optimization on that set. To make that concrete, if n=2 then SVM is picking a line which optimally separates the (+) examples from the (-) examples.

What this means is that the result of training an SVM is tied to the dimensionality it was trained in. This dimensionality is exactly the size of your feature set (modulo kernels and other transformations, but in any case all of that information together uniquely sets the problem space). You thus cannot just apply this trained model to new data which exists in a space of a different dimensionality.

(You might suggest that we project or embed the training space into the test space—and that might work in some circumstances even—but it's invalid generally.)

This situation gets even trickier when you really analyze it, though. Not only does the test data dimensionality need to correspond with the training data dimensionality but the meaning of each dimension needs to be constant. For instance, back in our n=2 example, assume that we're classifying people's moods (happy/sad) and the x dimension is "enjoyment of life" and the y dimension is "time spent listening to sad music". We'd expect that greater x and lesser y values improve the likelihood of being happy, so a good discrimination boundary that SVM could find would be the y=x line as people closer to the x axis tend to be happy and closer to the y axis tend to be sad.

But then lets say someone bumbles and mixes up the x and y dimensions when they drop the test data in. Boom, suddenly you've got an incredibly inaccurate predictor.


So in particular, the observation space of the test data must match the observation space o the training data. Dimensionality is an important step in this regard, but the match must actually be perfect.

Which is a long way of saying that you need to either do some feature engineering or find an algorithm without this kind of dependency (which will also involve some feature engineering).

J. Abrahamson
  • 72,246
  • 9
  • 135
  • 180
0

Do we have to explicitly set the no of features of the test data to be equal to 9451 in this case?

Yes you do. SVM needs to manage the same dimension as the training set. What people tend to do when working with documents is using a bag of words approach or select the first x less common words.

Pedrom
  • 3,823
  • 23
  • 26