Python - Predicting test data that is smaller than train data

Question

I have preprocessed some data ready to train a Multinomial Naive Bayes classification. The train data is 80% of my data and the test data is 20%.

The train data is an array of size 8452 and the test data is an array of size of 4231

If I want to see the predictions of train data I execute the following code just fine

multiNB = MultinomialNB()

model = multiNB.fit(x_train, y_train)

y_preds = model.predict(x_train)

but if I want to predict my test i.e.

y_preds = model.predict(x_test)

I get the following error:

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
 with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 8452 is different from 4231)

If I need to provide more information about my code please ask, but I am stuck here and I do not really understand what is causing that error, and any help is welcomed.

This is how I obtained my train-test sets:

total_count = len(tokenised_reviews)

split = int(total_count * 0.8)

shuffle = np.random.permutation(total_count)

x = []
y = []

for i in range(total_count):
    x.append(x_data[shuffle[i]])
    y.append(y_data[shuffle[i]])

x_train = x[:split]
x_test = x[split:]

y_train = y[:split]
y_test = y[split:]

i cannot reproduce your error.. did you split train test correctly? can you share how you obtained train and test set — StupidWolf, Nov 17 '20 at 16:21
if `x_data` is an array, skip the loop and do `x = x_data[shuffle]`. Same for `y`. Might help. Do you need to transpose the arrays or something? — Mad Physicist, Nov 17 '20 at 16:53
I need to append the data so that I can index through x and y — apol96, Nov 17 '20 at 17:40
you can just call the data out using the index as suggested. Or you can use train_test_split from scikit learn... the reason I asked about the splitting is because I cannot reproduce your error using an example dataset — StupidWolf, Nov 17 '20 at 17:42

score 0 · Answer 1 · answered Nov 17 '20 at 18:11

0

Too long to type as a comment, I got a very weird structure when I tried your again. I have no idea what is x_data so hard to explain what is the exact error.

i suspect something went wrong with putting the data back into a list again, so if you do this:

total_count = len(x_train)
split = int(total_count * 0.8)
shuffle = np.random.permutation(total_count)

x_train = x_data[shuffle[split:]]
x_test = x_data[shuffle[:split]]

y_train = y_data[shuffle[split:]]
y_test = y_data[shuffle[:split]]

You should get your x_train and x_test as a subset of the original data.

Or you can simply do:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2)

answered Nov 17 '20 at 18:11

StupidWolf

45,075
17
40
72

The data that I want to fit are eventually an np.array `print(type(x_train_arr), (x_train_arr.shape)) (1105, 8550) print(type(x_test_arr), (x_test_arr.shape)) (277, 4002)` I have tried your suggestions but the result is the same error every time – apol96 Nov 17 '20 at 21:27
you see that your shape is different. I don't think it's possible if you did what I have above. what is ```x_data``` and how is it different from ```tokenised_reviews``` – StupidWolf Nov 17 '20 at 21:31
you have to tabulate the train and test together, then split them – StupidWolf Nov 17 '20 at 21:47

Python - Predicting test data that is smaller than train data

1 Answers1