I am attempting to use BoW before ML on my text based dataset. But, I do not want my training set to influence my test set (i.e., data leakage). I want to deploy BoW on the train set before the test set. But, then my test set has different features (i.e., words) than my train set so the matrices are not the same size. I tried keeping columns in the test set that also appear in the train set but 1) My code is not right and 2) I do not think this is the most efficient procedure. I think I also need code to add filler columns? Here is what I have:
from sklearn.feature_extraction.text import CountVectorizer
def bow (tokens, data):
tokens = tokens.apply(nltk.word_tokenize)
cvec = CountVectorizer(min_df = .01, max_df = .99, ngram_range=(1,2), tokenizer=lambda doc:doc, lowercase=False)
cvec.fit(tokens)
cvec_counts = cvec.transform(tokens)
cvec_counts_bow = cvec_counts.toarray()
vocab = cvec.get_feature_names()
bow_model = pd.DataFrame(cvec_counts_bow, columns=vocab)
return bow_model
X_train = bow(train['text'], train)
X_test = bow(test['text'], test)
vocab = list(X_train.columns)
X_test = test.filter.columns([w for w in X_test if w in vocab])