I am struggling with a dimension error when I try to predict using naive bayes classifier.
The data consists of a column for sentences and then a column for sentiments (aka labels). I want to use a naive bayes classifier to predict the sentiment of each sentence.
I start off with separating out testing, training and validation data sets
import pandas as pd
from sklearn.feature_extraction.text import (CountVectorizer,TfidfVectorizer, TfidfTransformer)
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import SelectKBest, chi2
training_set,sentence_split_further,training_set_sentiments,sentiments_split_further=train_test_split(sentence_data.Sentence,sentence_data.Sentiment,test_size=.5, train_size=.5, random_state=1)
testing_set,validation_set,testing_set_sentiments,validation_set_sentiments=train_test_split(sentence_split_further,sentiments_split_further,test_size=.5, train_size=.5, random_state=1)
Then I create a feature matrix, apply tfid and prune the best k words. I did this all in a function that I created called feature_selection_vector
tfidf_testing_feature_matrix=feature_selection_vector(testing_set,testing_set_sentiments)
tfidf_validation_feature_matrix=feature_selection_vector(validation_set,validation_set_sentiments)
Here is the code for the feature_selection_vector
function
def feature_selection_vector( sentence_data, sentiments ):
#creates the feature vector and calculates tfid
vectorizer = CountVectorizer(analyzer='word',
token_pattern=r'\b[a-zA-Z]{3,}\b',
ngram_range=(1, 1)
)
count_vectorized = vectorizer.fit_transform(sentence_data)
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
vectorized = tfidf_transformer.fit_transform(count_vectorized)
vector=pd.DataFrame(vectorized.toarray(),
index=['sentence '+str(i)
for i in range(1, 1+len(sentence_data))],
columns=vectorizer.get_feature_names())
selector = SelectKBest(chi2, k=1000)
selector.fit(vector, sentiments)
return vector
Now I want to fit the Naive Bayes Classifier with training data and then use the model to predict using testing data.
naive_bayes = MultinomialNB()
naive_bayes.fit(tfidf_training_feature_matrix,training_set_sentiments)
NBC_tfidf_sentiment_predicted=naive_bayes.predict(tfidf_testing_feature_matrix)
However I keep getting this error:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 892 is different from 348)
The two sizes it is complaining about is number of columns of the training set (892) and the number of columns of the testing set (348)