0

I have preprocessed some data ready to train a Multinomial Naive Bayes classification. The train data is 80% of my data and the test data is 20%.

The train data is an array of size 8452 and the test data is an array of size of 4231

If I want to see the predictions of train data I execute the following code just fine

multiNB = MultinomialNB()

model = multiNB.fit(x_train, y_train)

y_preds = model.predict(x_train)

but if I want to predict my test i.e.

y_preds = model.predict(x_test)

I get the following error:

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
 with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 8452 is different from 4231)

If I need to provide more information about my code please ask, but I am stuck here and I do not really understand what is causing that error, and any help is welcomed.

This is how I obtained my train-test sets:

total_count = len(tokenised_reviews)

split = int(total_count * 0.8)

shuffle = np.random.permutation(total_count)

x = []
y = []

for i in range(total_count):
    x.append(x_data[shuffle[i]])
    y.append(y_data[shuffle[i]])

x_train = x[:split]
x_test = x[split:]

y_train = y[:split]
y_test = y[split:]
apol96
  • 200
  • 12

1 Answers1

0

Too long to type as a comment, I got a very weird structure when I tried your again. I have no idea what is x_data so hard to explain what is the exact error.

i suspect something went wrong with putting the data back into a list again, so if you do this:

total_count = len(x_train)
split = int(total_count * 0.8)
shuffle = np.random.permutation(total_count)

x_train = x_data[shuffle[split:]]
x_test = x_data[shuffle[:split]]

y_train = y_data[shuffle[split:]]
y_test = y_data[shuffle[:split]]

You should get your x_train and x_test as a subset of the original data.

Or you can simply do:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2)
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • The data that I want to fit are eventually an np.array `print(type(x_train_arr), (x_train_arr.shape)) (1105, 8550) print(type(x_test_arr), (x_test_arr.shape)) (277, 4002)` I have tried your suggestions but the result is the same error every time – apol96 Nov 17 '20 at 21:27
  • you see that your shape is different. I don't think it's possible if you did what I have above. what is ```x_data``` and how is it different from ```tokenised_reviews``` – StupidWolf Nov 17 '20 at 21:31
  • you have to tabulate the train and test together, then split them – StupidWolf Nov 17 '20 at 21:47