2

With Doc2Vec modelling, I have trained a model and saved following files:

1. model
2. model.docvecs.doctag_syn0.npy
3. model.syn0.npy
4. model.syn1.npy
5. model.syn1neg.npy

However, I have a new way to label the documents and want to train the model again. since the word vectors already obtained from previous version. Is there any way to reuse that model (e.g., taking the previous w2v results as initial vectors for training)? Any one know how to do it?

sophros
  • 14,672
  • 11
  • 46
  • 75
HappyCoding
  • 5,029
  • 7
  • 31
  • 51

1 Answers1

2

I've figured out that, we can just load the model and continue to train.

model = Doc2Vec.load("old_model")
model.train(sentences)
HappyCoding
  • 5,029
  • 7
  • 31
  • 51
  • 1
    You can do this, but (1) if the new `sentences` has new words/tags, they'll be skipped as unknown; (2) if the new `sentences` has a different length, progress reports & learning-rate decay may not be updated properly; (3) it may offer the model a slight 'head start' on useful values, and skips the initial vocabulary-scan, but won't cause the `train()` itself to go any faster. – gojomo Jan 19 '17 at 02:26
  • It's best to train with all examples mixed-together. You could conceivably start with a previously-loaded model, if you call `train()` with the right parameters as hints-of-corpus size. Or, to adapt to new vocabulary/tags, you could do `build_vocab()` with the new combined corpus, but then try to give the model a 'head-start` by manually copying over vectors from the original model. – gojomo Jan 23 '17 at 03:43
  • @gojomo, thanks. agree with you that, if time allows, it's always better off by training from start. otherwise, it's advisable to take good care of vocabulary coverage at least. – HappyCoding Jan 24 '17 at 02:17
  • you should also check https://stackoverflow.com/questions/47775557/updating-training-documents-for-gensim-doc2vec-model – Amir Imani Jan 17 '19 at 16:56