0

I am struggling with a dimension error when I try to predict using naive bayes classifier.

The data consists of a column for sentences and then a column for sentiments (aka labels). I want to use a naive bayes classifier to predict the sentiment of each sentence.

I start off with separating out testing, training and validation data sets

import pandas as pd
from sklearn.feature_extraction.text import (CountVectorizer,TfidfVectorizer, TfidfTransformer)
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import SelectKBest, chi2

training_set,sentence_split_further,training_set_sentiments,sentiments_split_further=train_test_split(sentence_data.Sentence,sentence_data.Sentiment,test_size=.5, train_size=.5, random_state=1)

testing_set,validation_set,testing_set_sentiments,validation_set_sentiments=train_test_split(sentence_split_further,sentiments_split_further,test_size=.5, train_size=.5, random_state=1)

Then I create a feature matrix, apply tfid and prune the best k words. I did this all in a function that I created called feature_selection_vector

tfidf_testing_feature_matrix=feature_selection_vector(testing_set,testing_set_sentiments)
tfidf_validation_feature_matrix=feature_selection_vector(validation_set,validation_set_sentiments)

Here is the code for the feature_selection_vector function

def feature_selection_vector( sentence_data, sentiments ):
    #creates the feature vector and calculates tfid
    vectorizer = CountVectorizer(analyzer='word', 
                                  token_pattern=r'\b[a-zA-Z]{3,}\b',  
                                  ngram_range=(1, 1) 
                                  )  
    count_vectorized = vectorizer.fit_transform(sentence_data)
    tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
    vectorized = tfidf_transformer.fit_transform(count_vectorized)
    
    vector=pd.DataFrame(vectorized.toarray(), 
                 index=['sentence '+str(i) 
                        for i in range(1, 1+len(sentence_data))],
                 columns=vectorizer.get_feature_names())
    selector = SelectKBest(chi2, k=1000)
    selector.fit(vector, sentiments)
    return vector

Now I want to fit the Naive Bayes Classifier with training data and then use the model to predict using testing data.

naive_bayes = MultinomialNB()
naive_bayes.fit(tfidf_training_feature_matrix,training_set_sentiments)
NBC_tfidf_sentiment_predicted=naive_bayes.predict(tfidf_testing_feature_matrix)

However I keep getting this error:

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 892 is different from 348)

The two sizes it is complaining about is number of columns of the training set (892) and the number of columns of the testing set (348)

desertnaut
  • 57,590
  • 26
  • 140
  • 166

1 Answers1

0

You cannot use fit_transform to get features for the validation and test sets, as you do here (using your feature_selection_vector() function).

fit_transform is used only once with the training data; for the validation and test ones, simple transform should be used instead, using the existing CountVectorizer and TfidfTransformer as they have already been fitted to the training data.

In your code, both the CountVectorizer and TfidfTransformer are fitted again with the validation and test data, leading to different number of features, and eventually to the expected error you report.

For more details, see What is the difference between fit_transform and transform in sklearn countvectorizer?

You should seriously think wrapping up all the stages in a pipeline.

desertnaut
  • 57,590
  • 26
  • 140
  • 166