What is gensim's 'docvecs'?

Question

The above picture is from Distributed Representations of Sentences and Documents, the paper introducing Doc2Vec. I am using Gensim's implementation of Word2Vec and Doc2Vec, which are great, but I am looking for clarity on a few issues.

For a given doc2vec model dvm, what is dvm.docvecs? My impression is that it is the averaged or concatenated vector that includes all of the word embedding and the paragraph vector, d. Is this correct, or is it d?
Supposing dvm.docvecs is not d, can one access d by itself? How?
As a bonus, how is d calculated? The paper only says:

In our Paragraph Vector framework (see Figure 2), every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W.

Thanks for any leads!

score 5 · Accepted Answer · answered Jan 19 '17 at 00:14

5

The docvecs property of the Doc2Vec model holds all trained vectors for the 'document tags' seen during training. (These are also referred to as 'doctags' in the source code.)

In the most simple case, analogous to the Paragraph Vectors paper, each text example (paragraph) just has a serial number integer ID as its 'tag', starting at 0. This would be an index into the docvecs object – and the model.docvecs.doctag_syn0 numpy array is essentially the same thing as the (capital) D in your excerpt from the Paragraph Vectors paper.

(Gensim also supports using string tokens as document tags, and multiple tags per document, and repeating tags across many of the training documents. For string tags, if any, they're mapped to indexes near the end of the docvecs by the dict model.docvecs.doctags.)

answered Jan 19 '17 at 00:14

gojomo

52,260
14
86
115

Thanks for the reply. If I understand your first sentence, `docvecs` is the unique document vector corresponding to the vector next to 'Average/Concatenate' in the figure above. Is that correct? – Michael Davidson Jan 19 '17 at 16:33
2

Actually `model.docvecs` is a helper object holding *all* the document-vectors being trained. It (and specifically its `doctag_syn0` array which is like the 'Paragraph Matrix in the diagram) is consulted to get an individual vector *D* (as in the diagram in orange), to mix with word-vectors for a single training example. – gojomo Jan 19 '17 at 19:25
Interesting. And when `dm=0` and thus the PV-DBOW algorithm is being used, that `model.docvecs` is equal to `model.docvecs.doctag_syn0`. This makes sense I suppose because there are not word embeddings being joined with the paragraph matrix. Thanks for the help! – Michael Davidson Jan 19 '17 at 19:55
The relationship between `model.docvecs` and its constituent raw numpy array `model.docvecs.doctag_syn0` is the same no matter what the mode. In all cases the individual doctag-keyed vectors live in `model.docvecs.doctag_syn0`. In pure DBOW these vectors are the only thing used to try to predict document words; in DM they are combined with word-vectors to predict nearby in-window words. – gojomo Jan 20 '17 at 00:42

What is gensim's 'docvecs'?

1 Answers1