3

I understand that you treat the paragraph ID as a new word in doc2vec (DM approach, left on the figure) during training. The training output is the context word. After a model is trained, suppose I want to get 1 embedding given a new document.

Do I feed each word to the network and then average it to get the embedding? Or is there another way?

I can feed this to gensim, but I am trying to understand how it works.

enter image description here

dorien
  • 5,265
  • 10
  • 57
  • 116

2 Answers2

4

During model bulk training, the candidate doc-vector is gradually nudged to be better at predicting the text's words, just like word-vector training. So at the end of training, you have doc-vectors for all the identifiers you provided alongside the texts.

You can access these from a gensim Doc2Vec model via doct-style indexed lookup of the identifier (called 'doctag' in gensim') you provided during training:model.docvecs[tag]`

Post-training, to get the doc-vector for a new text, an inference process is used. The model is held frozen, and a new random candidate vector (just like those that started bulk training for training texts) is formed for the text. Then it's incrementally nudged, in a manner fully analogous to training, to be better at predicting the words – but only this one new candidate vector is changed. (All model internal weights stay the same.)

You can calculate such new vectors via the infer_vector() method, which takes a list-of-word-tokens that should have been preprocessed just like the texts provided during training: model.infer_vector(words).

gojomo
  • 52,260
  • 14
  • 86
  • 115
1

I think using the above methods that freezes the model, only random new paragraph vectors and retraining should be more effective, but I see a statement that simply using the average of all word vectors in a sentence is more effective in some cases.

YaFeng Luo
  • 71
  • 8