I am trying to do a MultinomialNB(). I have a csv, that I read into a dataframe (data) and did some tokenizing and lemmatization on the data in order to have the most used words. The code for the model is this:
max_features = 5000
count_vectorizer = CountVectorizer(max_features=max_features , stop_words= "english")
sparce_matrix = count_vectorizer.fit_transform(Tweet_list).toarray()
y = data.iloc[:,0].values
x = sparce_matrix
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.1)
from sklearn.naive_bayes import MultinomialNB
Mn = MultinomialNB()
Mn.fit(x_train, y_train)
y_pred = Mn.predict(x_test)
print("Accuracy: ", Mn.score(y_pred.reshape(-1,1),y_test))
When i print the sizes of the variables:
print(y.size)
print(x.size)
print(x_train.size)
print(y_train.size)
print(x_test.size)
print("y test", y_test.size)
print("y pred", y_pred.size)
I get:
86460
432300000
389070000
77814
43230000
y test 8646
y pred 8646
However the model fails with ValueError: shapes (8646,1) and (5000,2) not aligned: 1 (dim 1) != 5000 (dim 0).
As far as I understand the problem is somewhere in the computation it does behind the methods where some np.dot(a, b) fails. It somehow computes the y_pred or y_test (8646) with a vector of the size of max features vector (5000). That is the only place where the value 5000 appears.