Doc2vec predictions - do we average the words or what is the paragraph ID for a new paragraph?

Question

I understand that you treat the paragraph ID as a new word in doc2vec (DM approach, left on the figure) during training. The training output is the context word. After a model is trained, suppose I want to get 1 embedding given a new document.

Do I feed each word to the network and then average it to get the embedding? Or is there another way?

I can feed this to gensim, but I am trying to understand how it works.

score 4 · Answer 1 · answered Oct 26 '18 at 12:13

During model bulk training, the candidate doc-vector is gradually nudged to be better at predicting the text's words, just like word-vector training. So at the end of training, you have doc-vectors for all the identifiers you provided alongside the texts.

You can access these from a gensim Doc2Vec model via doct-style indexed lookup of the identifier (called 'doctag' in gensim') you provided during training:model.docvecs[tag]`

Post-training, to get the doc-vector for a new text, an inference process is used. The model is held frozen, and a new random candidate vector (just like those that started bulk training for training texts) is formed for the text. Then it's incrementally nudged, in a manner fully analogous to training, to be better at predicting the words – but only this one new candidate vector is changed. (All model internal weights stay the same.)

You can calculate such new vectors via the infer_vector() method, which takes a list-of-word-tokens that should have been preprocessed just like the texts provided during training: model.infer_vector(words).

score 1 · Answer 2 · answered Apr 06 '19 at 12:39

1

I think using the above methods that freezes the model, only random new paragraph vectors and retraining should be more effective, but I see a statement that simply using the average of all word vectors in a sentence is more effective in some cases.

answered Apr 06 '19 at 12:39

YaFeng Luo

71
8

Doc2vec predictions - do we average the words or what is the paragraph ID for a new paragraph?

2 Answers2