0

I Already have a Doc2Vec model. I have trained it with my train data.

Now after a while I want to use Doc2Vec for my test data. I want to add my test data vocabulary to my existing model's vocabulary. How can I do this? I mean how can I update my vocabulary?

Here is my model:

    model = model.load('my_model.Doc2vec')

1 Answers1

0

Words that weren't present for training mean nothing to Doc2Vec, so quite commonly, they're just ignored when encountered in later texts.

It would only make sense to add new words to a model if you could also do more training, including those new words, to somehow integrate them with the existing model.

But, while such continued incremental training is theoretically possible, it also requires a lot of murky choices of how much training should be done, at what alpha learning rates, and to what extent older examples should also be retrained to maintain model consistency. There's little published work suggesting working rules-of-thumb, and doing it blindly could just as likely worsen the model's performance as improve it.

(Also, while the parent class for Doc2Vec, Word2Vec, offers an experimental update=True option on its build_vocab() step for later vocabulary-expansion, it wasn't designed or tested with Doc2Vec in mind, and there's an open issue where trying to use it causes memory-fault crashes: https://github.com/RaRe-Technologies/gensim/issues/1019.)

Note that since Doc2Vec is an unsupervised method for creating features from text, if your ultimate task is using Doc2Vec features for classification, it can sometimes be sensible to include your 'test' texts (without class labeling) in the Doc2Vec training set, so that it learns their words and the (unsupervised) relations to other words. The separate supervised classifier would then only be trained on non-test items, and their known labels.

gojomo
  • 52,260
  • 14
  • 86
  • 115