I have used TfidfVectorizer to transform text data for my LogisticRegression model.
Before I use
fit_transform
on the features I took out part of the data (there are 5 sources of data and I take out one of them to see later how my model generalizes).Then I split the features into train and test data
Then after the model was trained I used
vectorizer.transform
on the totally unseen data but I get the error above.vectorizer = TfidfVectorizer(ngram_range=(1,2)) test_unseen_data = clean_df[clean_df['category']!='other'] clean_df = clean_df[clean_df['category']=='other'] X_features = vectorizer.fit_transform(clean_df['lemmatized_sentence']) y_labels = clean_df['type'] # train unseen_features = vectorizer.transform(test_unseen_data['lemmatized_sentence']) unseen_labels = test_unseen_data['type'] y_pred = best_log_regr.predict(unseen_features) y_decision_func = best_log_regr.predict_proba(unseen_features)
So the features on which I used fit_transform
have the following characteristics:
<51762x22203 sparse matrix of type '<class 'numpy.float64'>'
with 1016880 stored elements in Compressed Sparse Row format>
The ones that are totally unseen:
<745x811 sparse matrix of type '<class 'numpy.float64'>'
with 12444 stored elements in Compressed Sparse Row format>
I have used fit_transform
and transform
before but I have never had such an issue before.