I have a gensim doc2vec model trained on around 1000 documents. Now I have to incrementally update this existing model by adding 100 newly tagged documents. I am not able to incrementally retrain this model. Can anyone help me with the same.
-
Sorry, this is a duplicate. See here: [Updating training documents for gensim Doc2Vec model](https://stackoverflow.com/questions/47775557/updating-training-documents-for-gensim-doc2vec-model) – polm23 Jul 28 '21 at 05:33
1 Answers
Gensim's Doc2Vec
does not have any official support for adding documents (or the new words or tags that might be in them) to an existing Doc2Vec
model.
You should either:
- Use inference to obtain full-document vectors for the new documents, by providing those docs (tokenized the same as the training data) to the method
.infer_vector()
. This uses training-like process to create a good vector for the new text, holding everything else about the model, like its known vocabulary, constant. (So, any novel words in the new document will be ignored.) The resulting vector should be usefully comparable to other vectors created by the original model training, or other vectors also inferred from the same model. - Retrain the model from scratch, using all the old and new documents together. (With only 1000 documents, how long could that take?)
The API docs for .infer_vector()
are at:
https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec.infer_vector
There are some examples of the use of .infer_vector()
in the micro-tutorial using the tiny Lee
corpus:
https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html
Note also that published results using the 'Paragraph Vector' algorithm inside Gensim Doc2Vec
tend to be on corpora of tens-of-thousands to millions to documents. With only 1000, it may be very hard to get good results from this algorithm, which benefits from very-large, very-varied training data.

- 52,260
- 14
- 86
- 115