I am working on a text data structuring.
I need to make predictions using the following format:
xyz@gmail.com -> Email, India -> Country, etc....
To achieve that, SVC
along with OneVsRestClassifier
is being used. The data extrapolation works just fine if the train and test subsets are in the same script.
However, the predictioning fails if evaluating separately, i.e. train and test data are in separate Python scripts).
The error I receive points to the variations in the sparse matrix dimensions.
help me kindly to resolve this variation issue.
Sample Trainer Module
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
#Tfidf vectorizer for text data
tfidf_enc = TfidfVectorizer(binary=True)
lbl_enc = LabelEncoder()
X = tfidf_enc.fit_transform(name_text)
X = X.astype('float16')
y = lbl_enc.fit_transform(name_label)
clf = SVC(C=100, kernel='rbf', degree=3,
gamma=1, coef0=1, shrinking=True,
probability=True, tol=0.001, cache_size=200,
class_weight=None, verbose=2, max_iter=-1,
decision_function_shape=None, random_state=None)
model = OneVsRestClassifier(clf, n_jobs=4)
model.fit(X,y)
import pickle
# save the model to disk
filename = 'D:/authAff_model.sav'
pickle.dump(model, open(filename, 'wb'))
Sample Testing Module
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
# load the model from disk
filename = 'D:/authAff_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))
#Prediction
test_as_text = ['France','xyz@gmail.com','Singapore']
test_as_text = [item.lower() for item in test_as_text]
tfidf_enc = TfidfVectorizer(binary=True)
X_test = tfidf_enc.fit_transform(test_as_text)
X_test = X_test.astype('float16')
y_test = loaded_model.predict(X_test)
Error Message
When testing as a separate script the following error occurs:
ValueError: X.shape[1] = 6 should be equal to 6104, the number of features at training time
Original dimensionality:
<3x6104 sparse matrix of type '<type 'numpy.float16'>'
with 3 stored elements in Compressed Sparse Row format>