I have a training date set for which I know the labels for the classification and a test data set where I do not havve the labels.
Now, I want to fit the Vectorizer to the union of the training and test reviews to not miss any words.
from sklearn.feature_extraction.text import CountVectorizer
#
vectorizer = CountVectorizer(encoding='str', stop_words="english", analyzer='word')
df_union=pd.concat([imdb_dataset_train,reviews_test])
df_union = df_union['review']
df_union.head()
X=vectorizer.fit_transform(df_union['review'])
X_train = ?
X_test= ?
How could I merge the test and train data again such that
X_train.shape[1]==X_test.shape[1]
?