2

I'm trying to make a pos/neg review classifier and wanted to use Multinomial naive bayes (or regular naive bayes). If I don't feature select using SelectKbest Chi2, it works fine. But if I do, I get the following error:

Traceback (most recent call last):

  File "<ipython-input-176-a426973d76d1>", line 1, in <module>
    bayes_predict = bayes.predict(X_dev)

  File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/site-packages/sklearn/naive_bayes.py", line 65, in predict
    jll = self._joint_log_likelihood(X)

  File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/site-packages/sklearn/naive_bayes.py", line 737, in _joint_log_likelihood
    return (safe_sparse_dot(X, self.feature_log_prob_.T) +

  File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 142, in safe_sparse_dot
    return np.dot(a, b)

  File "<__array_function__ internals>", line 6, in dot

ValueError: shapes (5000,7001) and (4000,2) not aligned: 7001 (dim 1) != 4000 (dim 0)

I'll explain the structure of my code:

size(train_dataset) = (15000,4) 
size(dev_dataset) = (5000, 4)
size(test_dataset) = (5000,4)

They are all pandas dataframes. I used 3 types of features (a 5000 one, 2000, and 1) so the train, test and dev arrays look:

size(X_train)=(15000, 70001)
size(X_dev) = = (5000,7001)
size(X_test) = (5000,7001)

For feature reduction, training and testing I use the following code:

chitest = SelectKBest(score_func=chi2, k=4000)
chi = chitest.fit(X_train, Y_train)
X_train_new = chi.transform(X_train)

bayes = MultinomialNB()
bayes.fit(X_train_new,Y_train)

bayes_predict = bayes.predict(X_dev)
print(classification_report(Y_test_gold, bayes_predict))

And this gives me the error from before, but I really can't figure out why.

user12195705
  • 147
  • 2
  • 10
  • Which line exactly gives you this error? – PeptideWitch Dec 18 '19 at 00:55
  • @PeptideWitch just edited the original post with that! – user12195705 Dec 18 '19 at 00:57
  • I think the problem may be in how you've split up your train/test sets. I can't really make sense of where the 7001-long arrays come from. sklearn has a really neat feature called `test_train_split` - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. Consider using this to help keep your sets in similar shape – PeptideWitch Dec 18 '19 at 01:06
  • @PeptideWitch the length of the sets was already defined when I got them. The 7001 thing refers to: 5000 features extracted with TFIDF, 2000 features extracted with BOW and 1 feature that represents review length. I don't think the sets are wrong to begin with because both logit reg. and SVM work fine. – user12195705 Dec 18 '19 at 01:09
  • Reading through your code here, it seems that you've found the best 4000 features of the `X_train` set and made the `X_train_new` set. However, after fitting with `bayes.fit(X_train_new,Y_train)` command, the model is now expecting an input of 4000 features for your predictions. Instead, you're going in with the raw `X_dev` dataset, which contains 7001 features. You need to extract the SelectKBest features from the `X_dev` dataset so that the feature lengths match. That's why you're getting the 7001/4000 missmatch error – PeptideWitch Dec 18 '19 at 02:40

0 Answers0