0

I have used TfidfVectorizer to transform text data for my LogisticRegression model.

  1. Before I use fit_transform on the features I took out part of the data (there are 5 sources of data and I take out one of them to see later how my model generalizes).

  2. Then I split the features into train and test data

  3. Then after the model was trained I used vectorizer.transform on the totally unseen data but I get the error above.

     vectorizer = TfidfVectorizer(ngram_range=(1,2))
    
     test_unseen_data = clean_df[clean_df['category']!='other']
     clean_df = clean_df[clean_df['category']=='other']
     X_features = vectorizer.fit_transform(clean_df['lemmatized_sentence'])
     y_labels = clean_df['type']
    
     # train
    
     unseen_features = vectorizer.transform(test_unseen_data['lemmatized_sentence'])
     unseen_labels = test_unseen_data['type']
    
     y_pred = best_log_regr.predict(unseen_features)
     y_decision_func = best_log_regr.predict_proba(unseen_features)
    

So the features on which I used fit_transform have the following characteristics:

<51762x22203 sparse matrix of type '<class 'numpy.float64'>'
    with 1016880 stored elements in Compressed Sparse Row format>

The ones that are totally unseen:

<745x811 sparse matrix of type '<class 'numpy.float64'>'
    with 12444 stored elements in Compressed Sparse Row format>

I have used fit_transform and transform before but I have never had such an issue before.

Yana
  • 785
  • 8
  • 23

0 Answers0