Build vocabulary only from training data or entire data?

Question

Should I build the vocabulary only from train data or all data, wouldn't that effect test data in both ways? I mean :

If we only build the vocab from train data, The model wouldn't recognize a lot of the words in the validation and testing data, if the word is not available in the vocabulary.
Would considering a pre-trained word embedding help in this situation (i.e. the model learns the new word not from training data but from the pre-trained word embedding)?
If yes, Would a randomly Initialized word embedding have the same effect?
On the contrary, I've seen many examples where the coders build their vocab from the entire data, testing and validation data are shared with training data. Wouldn't this be an obvious data leakage problem?

roman · Answer 1 · 2020-06-25T15:04:49.483

If you're talking about word embeddings, then you should have some special token for out-of-vocabulary words (you probably don't want to have all unique words, but rather top N). E.g. add a special token like [UNK], and replace every unknown word with it.
If you have pre-trained word embeddings and small training set, use them as initial point.
Also, there's no reason to initialize embeddings for the words that you won't optimize during training.
The only information that may leak is word frequency, which is not a serious issue.

1 Answers1