Best way of matching the feature spaces in this classification problem using Tfidf and SVM

Question

I am training a model to detect spam/ham emails, and feature selecting by doing:

t = TfidfVectorizer(max_features=num_feature)

t.fit_transform(spam_corpus)
spam_features = t.get_feature_names()

t.fit_transform(ham_corpus)
ham_features = t.get_feature_names()

joblib.dump(t, './output/tfidf.pkl')

return spam_features + ham_features

The feature space contains both ham and spam features. I am saving the Tfidf model to then be used to predict a totally new, separate email, like this. But, on this new email, only half the number of features are created (because I am not adding spam + ham), and therefore the SVM classifier cannot predict anything.

What is the best way of dealing with this, such that I have an equal number of features on the trained Tfidf model AND the new email?

Please don't edit the answer into your question. You can [self-answer your question](https://stackoverflow.com/help/self-answer) so other people can see you've already figured the solution out, or others can easily find the answer if they have the same problem. — Mihai Chelaru, Apr 23 '20 at 22:49

score -1 · Answer 1 · answered Apr 23 '20 at 23:43

-1

I didn't realise fit_transform totally replaces the previous one. I just had to save both separately.

answered Apr 23 '20 at 23:43

Sid Jones

59
1
9

Best way of matching the feature spaces in this classification problem using Tfidf and SVM

1 Answers1