I am trying to build a simple SVM document classifier using scikit-learn and I am using the following code :
import os
import numpy as np
import scipy.sparse as sp
from sklearn.metrics import accuracy_score
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import cross_validation
from sklearn.datasets import load_svmlight_file
clf=svm.SVC()
path="C:\\Python27"
f1=[]
f2=[]
data2=['omg this is not a ship lol']
f=open(path+'\\mydata\\ACQ\\acqtot','r')
f=f.read()
f1=f.split(';',1085)
for i in range(0,1086):
f2.append('acq')
f1.append('shipping ship')
f2.append('crude')
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1)
counter = CountVectorizer(min_df=1)
x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.fit_transform(data2)
num_sample,num_features=x_train.shape
test_sample,test_features=x_test.shape
print("#samples: %d, #features: %d" % (num_sample, num_features)) #samples: 5, #features: 25
print("#samples: %d, #features: %d" % (test_sample, test_features))#samples: 2, #features: 37
y=['acq','crude']
#print x_test.n_features
clf.fit(x_train,f2)
#den= clf.score(x_test,y)
clf.predict(x_test)
It gives the following error :
(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 6 should be equal to 9451, the number of features at training time
But what I am not understanding is why does it expect the no. of features to be the same? If I am entering an absolutely new text data to the machine which it needs to predict, it's obviously not possible that every document will have the same number of features as the data which was used to train it. Do we have to explicitly set the no of features of the test data to be equal to 9451 in this case?