ValueError: operands could not be broadcast together with shapes in Naive bayes classifier

Question

Getting straight to the point:

1) My goal was to apply NLP and Machine learning algorithm to classify a dataset containing sentences into 5 different types of categories(numeric). For e.g. "I want to know details of my order -> 1".

Code:

import numpy as np
import pandas as pd

dataset = pd.read_csv('Ecom.tsv', delimiter = '\t', quoting = 3)

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []
for i in range(0, len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['User'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

# # Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

Everything works fine here, the model is trained well and predicts correct results for test data.

2) Now i wanted to use this trained model to predict a category for a new sentence. So i pre-processed the text in the same way i did for my dataset.

Code:

#Pre processing the new input
new_text = "Please tell me the details of this order"
new_text = new_text.split()
ps = PorterStemmer()
processed_text = [ps.stem(word) for word in new_text if not word in set(stopwords.words('english'))]

vect = CountVectorizer()
Z = vect.fit_transform(processed_text).toarray()
classifier.predict(Z)

ValueError: operands could not be broadcast together with shapes (4,4) (33,)

The only thing i am able to understand is that when i transformed my corpus the first time i trained my model, the shape of the numpy array is (18, 33). Second time when i am trying to predict for a new input, when i transformed my processed_text using fit_transform(), the numpy array shape is (4, 4).

I am not able to figure out is there any process here that i applied incorrectly? What can be the resolution. Thanks in advance! :)

yes you got the problem right! You will have to save the transform object you used at training time and then apply it at test time (only `transform()`). This will allow you to end up having the same size. [Here](https://stackoverflow.com/questions/24152282/saving-a-feature-vector-for-new-data-in-scikit-learn) is pretty much the same question answered in a few different ways — lorenzori, Jan 08 '18 at 16:39
@lorenzori Thanks for answering. However i am still not able to understand. Can you please elaborate your solution a bit? — Shikhar Thapliyal, Jan 08 '18 at 16:44
say you have a corpus made of 33 different words, then your bag of words at training time will have 33 columns. Now you are using another corpus which has only 4 different words. You end up with a matrix with 4 columns, and the model won't like that! hence you need to fit the second corpus in the same bag of words matrix you had at the beginning, with 33 columns. There are different ways to do this, well explained in the link above! — lorenzori, Jan 08 '18 at 16:48

score 5 · Accepted Answer · answered Jan 09 '18 at 07:38

you got the the problem right!

Say you have a corpus made of 33 different words, then your bag of words at training time will have 33 columns. Now you are using another corpus which has only 4 different words. You end up with a matrix with 4 columns, and the model won't like that! hence you need to fit the second corpus in the same bag of words matrix you had at the beginning, with 33 columns. There are different ways to do this, well explained here.

For example one way is to save the transform object you used at training time with fit() and then apply it at test time (only transform())!

ValueError: operands could not be broadcast together with shapes in Naive bayes classifier

1 Answers1