-1

I am training a model to detect spam/ham emails, and feature selecting by doing:

t = TfidfVectorizer(max_features=num_feature)

t.fit_transform(spam_corpus)
spam_features = t.get_feature_names()

t.fit_transform(ham_corpus)
ham_features = t.get_feature_names()

joblib.dump(t, './output/tfidf.pkl')

return spam_features + ham_features

The feature space contains both ham and spam features. I am saving the Tfidf model to then be used to predict a totally new, separate email, like this. But, on this new email, only half the number of features are created (because I am not adding spam + ham), and therefore the SVM classifier cannot predict anything.

What is the best way of dealing with this, such that I have an equal number of features on the trained Tfidf model AND the new email?

Sid Jones
  • 59
  • 1
  • 9
  • Please don't edit the answer into your question. You can [self-answer your question](https://stackoverflow.com/help/self-answer) so other people can see you've already figured the solution out, or others can easily find the answer if they have the same problem. – Mihai Chelaru Apr 23 '20 at 22:49

1 Answers1

-1

I didn't realise fit_transform totally replaces the previous one. I just had to save both separately.

Sid Jones
  • 59
  • 1
  • 9