0

I am working on a text data structuring.

I need to make predictions using the following format:

xyz@gmail.com -> Email, India -> Country, etc....

To achieve that, SVC along with OneVsRestClassifier is being used. The data extrapolation works just fine if the train and test subsets are in the same script.

However, the predictioning fails if evaluating separately, i.e. train and test data are in separate Python scripts).

The error I receive points to the variations in the sparse matrix dimensions.

help me kindly to resolve this variation issue.


Sample Trainer Module

import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

#Tfidf vectorizer for text data

tfidf_enc = TfidfVectorizer(binary=True)
lbl_enc = LabelEncoder()

X = tfidf_enc.fit_transform(name_text)
X = X.astype('float16')

y = lbl_enc.fit_transform(name_label)
clf = SVC(C=100, kernel='rbf', degree=3,
          gamma=1, coef0=1, shrinking=True, 
          probability=True, tol=0.001, cache_size=200,
          class_weight=None, verbose=2, max_iter=-1,
          decision_function_shape=None, random_state=None) 
model = OneVsRestClassifier(clf, n_jobs=4)
model.fit(X,y)

import pickle
# save the model to disk
filename = 'D:/authAff_model.sav'
pickle.dump(model, open(filename, 'wb'))

Sample Testing Module

import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
# load the model from disk
filename = 'D:/authAff_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))

#Prediction
test_as_text = ['France','xyz@gmail.com','Singapore']
test_as_text = [item.lower() for item in test_as_text]

tfidf_enc = TfidfVectorizer(binary=True)

X_test = tfidf_enc.fit_transform(test_as_text)
X_test = X_test.astype('float16')
y_test = loaded_model.predict(X_test)

Error Message

When testing as a separate script the following error occurs:

ValueError: X.shape[1] = 6 should be equal to 6104, the number of features at training time

Original dimensionality:

<3x6104 sparse matrix of type '<type 'numpy.float16'>'
    with 3 stored elements in Compressed Sparse Row format>
Community
  • 1
  • 1
RAMASWAMY M
  • 49
  • 1
  • 6
  • Possible duplicate of [Python-Scikit. Training and testing data using SVM](https://stackoverflow.com/questions/42029956/python-scikit-training-and-testing-data-using-svm) – Vivek Kumar Aug 17 '17 at 13:27
  • See the above question which describes the same problem and almost same answer by me. Just using the `joblib` instead of pickle because its the [recommended way](http://scikit-learn.org/stable/modules/model_persistence.html) for scikit. The error is you are using different TfidfVectorizers for training and testing. Save the original `TfidfVectorizer` with the model and load in testing module and only call `X_test = loaded_tfidf.transform(test_as_text)`, not `fit_transform()`. – Vivek Kumar Aug 17 '17 at 13:31
  • @VivekKumar Thank you so much your guidance :-) – RAMASWAMY M Aug 17 '17 at 16:09

0 Answers0