0

I am trying to do a MultinomialNB(). I have a csv, that I read into a dataframe (data) and did some tokenizing and lemmatization on the data in order to have the most used words. The code for the model is this:

max_features = 5000
count_vectorizer = CountVectorizer(max_features=max_features , stop_words= "english") 
sparce_matrix = count_vectorizer.fit_transform(Tweet_list).toarray()
y = data.iloc[:,0].values
x = sparce_matrix

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.1)

from sklearn.naive_bayes import MultinomialNB

Mn = MultinomialNB()
Mn.fit(x_train, y_train)
y_pred = Mn.predict(x_test)
print("Accuracy: ", Mn.score(y_pred.reshape(-1,1),y_test))

When i print the sizes of the variables:

print(y.size)
print(x.size)
print(x_train.size)
print(y_train.size)
print(x_test.size)
print("y test", y_test.size)
print("y pred", y_pred.size)

I get:

86460
432300000
389070000
77814
43230000
y test 8646
y pred 8646

However the model fails with ValueError: shapes (8646,1) and (5000,2) not aligned: 1 (dim 1) != 5000 (dim 0).

As far as I understand the problem is somewhere in the computation it does behind the methods where some np.dot(a, b) fails. It somehow computes the y_pred or y_test (8646) with a vector of the size of max features vector (5000). That is the only place where the value 5000 appears.

  • Can you print out shape instead of size? Also, at which line is the error occuring? – ranka47 Apr 13 '21 at 14:50
  • y (86460,) x (86460, 5000) x_train (77814, 5000) y_train (77814,) x_test (8646, 5000) y_test (8646,) y_pred (8646,) These are the shapes. Also the error was in the last line, print("Accuracy: ", Mn.score(y_pred.reshape(-1,1),y_test)) – Seth Hexflame Apr 13 '21 at 20:33

1 Answers1

0

If you refer to the documentation of MultinomialNB, you can see that the first input to the score function is NOT y_pred but X. Hence, the call to the score function should be,

print("Accuracy: ", Mn.score(x_test,y_test))

self.predict(x_test) will get automatically called inside the function score.

Documentation should always be the first method of debugging your code.

ranka47
  • 995
  • 8
  • 25