fit CountVectorizer on training and test data to not miss any words

Question

I have a training date set for which I know the labels for the classification and a test data set where I do not havve the labels.

Now, I want to fit the Vectorizer to the union of the training and test reviews to not miss any words.

from sklearn.feature_extraction.text import CountVectorizer

                                                         #

vectorizer = CountVectorizer(encoding='str', stop_words="english", analyzer='word')

df_union=pd.concat([imdb_dataset_train,reviews_test])
df_union = df_union['review']
df_union.head()

X=vectorizer.fit_transform(df_union['review'])

X_train = ?
X_test= ?

How could I merge the test and train data again such that X_train.shape[1]==X_test.shape[1] ?

score 2 · Answer 1 · answered Nov 14 '21 at 22:17

There is a lot of confusion in your question:

You will need some labelled test set in order to evaluate the model. If you train a model and apply it directly to an unlabelled test set, then you don't know if the model actually works. It's like taking some random medicine without knowing if it's suitable for your problem. An alternative to a labelled test set is to use k-fold cross-validation on the training set.
CountVectorizer is only a representation (encoding) of the text, you need a classification algorithm in order to train a model on it (for example a decision tree).
The model cannot and should not use anything from the test set:
- First because this is data leakage, which means the evaluation would be wrong.
- Second because it doesn't make sense: what can the model learn from the training set about words which don't appear in the training set? Nothing of course, so it would be totally pointless to have words which don't appear in the training set as features.

Keep in mind that the model tries to capture the statistical patterns found in the training set. The goal is not to have a vocabulary as complete as possible, it's to make the model able to predict new instances as accurately as possible. In fact it's very often much better for performance to even ignore the least frequent words in the training set, because they cause statistical noise.

fit CountVectorizer on training and test data to not miss any words

1 Answers1