Words that weren't present for training mean nothing to Doc2Vec
, so quite commonly, they're just ignored when encountered in later texts.
It would only make sense to add new words to a model if you could also do more training, including those new words, to somehow integrate them with the existing model.
But, while such continued incremental training is theoretically possible, it also requires a lot of murky choices of how much training should be done, at what alpha
learning rates, and to what extent older examples should also be retrained to maintain model consistency. There's little published work suggesting working rules-of-thumb, and doing it blindly could just as likely worsen the model's performance as improve it.
(Also, while the parent class for Doc2Vec
, Word2Vec
, offers an experimental update=True
option on its build_vocab()
step for later vocabulary-expansion, it wasn't designed or tested with Doc2Vec
in mind, and there's an open issue where trying to use it causes memory-fault crashes: https://github.com/RaRe-Technologies/gensim/issues/1019.)
Note that since Doc2Vec
is an unsupervised method for creating features from text, if your ultimate task is using Doc2Vec
features for classification, it can sometimes be sensible to include your 'test' texts (without class labeling) in the Doc2Vec
training set, so that it learns their words and the (unsupervised) relations to other words. The separate supervised classifier would then only be trained on non-test items, and their known labels.