I am training a model to detect spam/ham emails, and feature selecting by doing:
t = TfidfVectorizer(max_features=num_feature)
t.fit_transform(spam_corpus)
spam_features = t.get_feature_names()
t.fit_transform(ham_corpus)
ham_features = t.get_feature_names()
joblib.dump(t, './output/tfidf.pkl')
return spam_features + ham_features
The feature space contains both ham and spam features. I am saving the Tfidf model to then be used to predict a totally new, separate email, like this. But, on this new email, only half the number of features are created (because I am not adding spam + ham), and therefore the SVM classifier cannot predict anything.
What is the best way of dealing with this, such that I have an equal number of features on the trained Tfidf model AND the new email?