Is passing sklearn tfidf matrix to train MultinomialNB model proper?

Question

I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both:

X = df_new['text_content']
y = df_new['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
vectorizer = TfidfVectorizer(stop_words='english')
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)

clf_lr = LogisticRegression()
clf_lr.fit(X_train_dtm, y_train)
y_pred = clf_lr.predict(X_test_dtm)
lr_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes

clf_mnb = MultinomialNB()
clf_mnb.fit(X_train_dtm, y_train)
y_pred = clf_mnb.predict(X_test_dtm)
mnb_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes

Currently lr_score > mnb_score always. I'm wondering how exactly MultinomialNB is using the tfidf matrix since the term frequency in tfidf is calculated based on no class information. Any chance that I should not feed tfidf matrix to MultinomialNB the same way I did to LogisticRegression?

Update: I understand the difference between results of TfidfVectorizer and CountVectorizer. And I also just checked the sources code of sklearn's MultinomialNB.fit() function, looks like it does expect a count as oppose to frequency. This will also explain the performance boost mentioned in my comment below. However, I'm still wondering if under any circumstances pass tfidf into MultinomialNB makes sense. The sklearn documentation briefly mentioned the possibility, but not much details.

Any advice would be much appreciated!

I figured out a way that boosted the MultinomialNB model performance a lot: instead of `TfidfVectorizer`, I fed it `CountVectorizer` matrix, and the accuracy score increased a lot, especially when I decreased the amount of training data, with `CountVectorizer`, `MultinomialNB` outperforms `LogisticRegression`! — ZEE, Jan 07 '18 at 20:44
You might want to try using precision and recall metrics to give you more informative results. Accuracy alone isn't really helpful in comparing results. It sounds like you may want to familiarise yourself with tfidf. — Nathan McCoy, Jan 07 '18 at 23:51
@NathanMcCoy to follow up on your comment on the metrics, I used accuracy here because the two classes I have are pretty balanced. I thought precision and recall play more role when the classes are very imbalanced...any other reasons P/R/F1 scores may benefit the evaluation more than accuracy? Thanks! — ZEE, Jan 08 '18 at 01:07
unless you have an equivalent set of class labels across each input, then precision/recall give useful information. accuracy gives an unrealistic view of prediction as can show high values when choosing majority case. — Nathan McCoy, Jan 08 '18 at 10:05

Is passing sklearn tfidf matrix to train MultinomialNB model proper?

0 Answers0