X has 4211 features, but GaussianNB is expecting 8687 features as input

Question

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

s_df=pd.read_csv('Sarcasm Dataset.csv')
s_df.rename({"Unnamed: 0":"number"}, axis="columns", inplace=True)

sarc_classify = s_df.drop(['number','sarcasm','irony','satire','understatement','overstatement','rhetorical_question'],axis=1)


X_train, X_test, y_train, y_test = train_test_split(sarc_classify['tweet'], sarc_classify['sarcastic'])

vectorizer = CountVectorizer()

X1=vectorizer.fit_transform(X_train.values.astype('U'))
X_train=X1.toarray()

X2=vectorizer.fit_transform(X_test.values.astype('U'))
X_test=np.array(X2.todense())

gnb =  GaussianNB()
naive_bayes = gnb.fit(X_train, y_train)
y_pred =gnb.predict(X_test)

So, i am getting this error. and the X_train and y_train values looks like this, before vectorizer. So, all i want is to implement a basic Naive Bayes using Sklearn.

Error:

ValueError                                Traceback (most recent call last) <ipython-input-243-52354d6c7ca6> in <module>()
      1 gnb =  GaussianNB()
      2 naive_bayes = gnb.fit(X_train, y_train)
----> 3 y_pred =gnb.predict(X_test)
      4 acc_score = accuracy_score(y_test, y_pred)
      5 print(acc_score)

3 frames /usr/local/lib/python3.7/dist-packages/sklearn/base.py in
_check_n_features(self, X, reset)
    399         if n_features != self.n_features_in_:
    400             raise ValueError(
--> 401                 f"X has {n_features} features, but {self.__class__.__name__} "
    402                 f"is expecting {self.n_features_in_} features as input."
    403             )

ValueError: X has 1549 features, but GaussianNB is expecting 3298 features as input.

Your corpus will be, in general, different between `X_train` and `X_test`, so the dimension of the `CountVectorizer` output for each will be different. Perhaps you should `fit` on the combined corpus, and then transform each. — rickhg12hs, Mar 14 '22 at 00:35

Poornima Devi · Answer 1 · 2022-08-04T05:28:28.190

This issue is seen when you fit_transform both your training set and test set using the tfidf or count vectorisers. Rather fit_transform only the train set, and then just transform the test set as shown below.

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

cv_train  = cv.fit_transform(X_train)
cv_test = cv.transform(X_test)

This is done because we would require the vocabulary and the document frequencies of the train set to be learnt and be transformed into a terms-document matrix, and when it comes to test set, just the learnt document frequencies is to used to only transform the test set into a terms-document matrix.

Reference :

https://towardsdatascience.com/training-a-naive-bayes-model-to-identify-the-author-of-an-email-or-document-17dc85fa630a

X has 4211 features, but GaussianNB is expecting 8687 features as input

1 Answers1

Linked