1

I'm doing this Univ. Of Washington assignment where i have to predict the score of sample_test_matrix (last few lines) using decision_function() in LogisticRegression . But the error that i'm getting is

    ValueError: X has 145 features per sample; expecting 113092

Here is the code :

   import pandas as pd 
   import numpy as np 
   from sklearn.linear_model import LogisticRegression

   products = pd.read_csv('amazon_baby.csv')

   def remove_punct (text) :
       import string 
       text = str(text)
       for i in string.punctuation:
          text = text.replace(i,"")
       return(text)

   products['review_clean'] = products['review'].apply(remove_punct)
   products = products[products.rating != 3]
   products['sentiment'] = products['rating'].apply(lambda x : +1 if x > 3 else  -1 )

   train_data_index = pd.read_json('module-2-assignment-train-idx.json')
   test_data_index = pd.read_json('module-2-assignment-test-idx.json')

   train_data = products.loc[train_data_index[0], :]
   test_data = products.loc[test_data_index[0], :]
   train_data = train_data.dropna()
   test_data = test_data.dropna()

   from sklearn.feature_extraction.text import CountVectorizer

   train_matrix = vectorizer.fit_transform(train_data['review_clean'])
   test_matrix = vectorizer.fit_transform(test_data['review_clean'])

   sentiment_model = LogisticRegression()
   sentiment_model.fit(train_matrix, train_data['sentiment'])
   print (sentiment_model.coef_)

   sample_data = test_data[10:13]
   print (sample_data)

   sample_test_matrix = vectorizer.transform(sample_data['review_clean'])
   scores = sentiment_model.decision_function(sample_test_matrix)
   print (scores)

Here is the products data :

          Name                                                         Review                                       Rating  

  0       Planetwise Flannel Wipes                              These flannel wipes are OK, but in my opinion ...       3  


  1       Planetwise Wipe Pouch                                 it came early and was not disappointed. i love...       5  


  2       Annas Dream Full Quilt with 2 Shams                   Very soft and comfortable and warmer than it l...       5  

  3       Stop Pacifier Sucking without tears with Thumb...     This is a product well worth the purchase.  I ...       5

  4       Stop Pacifier Sucking without tears with Thumb...      All of my kids have cried non-stop when I trie...       5 
harshi
  • 343
  • 2
  • 4
  • 10

1 Answers1

1

This line is causing errors in the subsequent lines:

test_matrix = vectorizer.fit_transform(test_data['review_clean'])

Change the above to this:

test_matrix = vectorizer.transform(test_data['review_clean'])

Explanation: Using fit_transform() will refit the CountVectorizer on the test data. So all the information about the training data will be lost and vocabulary will be calculated only from test data.

Then you are using that vectorizer object to transform the sample_data['review_clean']. So the features in that will be only those which are learnt from test_data.

But the sentiment_model is trained on vocabulary from train_data. Hence the features are different.

Always use transform() on test data, never fit_transform().

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • @harshi I have added the explanation. Please go through it and ask if still not understanding. Also, if this helped you consider upvoting/accepting the answer. – Vivek Kumar Nov 09 '17 at 15:18