X has 811 features, but LogisticRegression is expecting 22203 features as input

Question

I have used TfidfVectorizer to transform text data for my LogisticRegression model.

Before I use fit_transform on the features I took out part of the data (there are 5 sources of data and I take out one of them to see later how my model generalizes).
Then I split the features into train and test data

Then after the model was trained I used vectorizer.transform on the totally unseen data but I get the error above.

 vectorizer = TfidfVectorizer(ngram_range=(1,2))

 test_unseen_data = clean_df[clean_df['category']!='other']
 clean_df = clean_df[clean_df['category']=='other']
 X_features = vectorizer.fit_transform(clean_df['lemmatized_sentence'])
 y_labels = clean_df['type']

 # train

 unseen_features = vectorizer.transform(test_unseen_data['lemmatized_sentence'])
 unseen_labels = test_unseen_data['type']

 y_pred = best_log_regr.predict(unseen_features)
 y_decision_func = best_log_regr.predict_proba(unseen_features)

So the features on which I used fit_transform have the following characteristics:

<51762x22203 sparse matrix of type '<class 'numpy.float64'>'
    with 1016880 stored elements in Compressed Sparse Row format>

The ones that are totally unseen:

<745x811 sparse matrix of type '<class 'numpy.float64'>'
    with 12444 stored elements in Compressed Sparse Row format>

I have used fit_transform and transform before but I have never had such an issue before.

Looks like your vocabulary differs in size. Confirm that you do *not* instantiate a new Vectorizer prior to use it on the unseen data. — Ric Hard, Nov 01 '22 at 12:20

X has 811 features, but LogisticRegression is expecting 22203 features as input

0 Answers0

Linked