(1) Yes, word-vectors are trained simultaneously with doc-vectors in PV-DM mode.
(2) The contents of the wv
property before training happens are the randomly-initialized, untrained word-vectors. (As in word2vec, all vectors get random, low-magnitude starting positions.)
(3) In plain PV-DBOW mode (dm=0
), because of code-sharing, the wv
vectors are still allocated & initialized – but never trained. At the end of PV-DBOW training, the wv
word-vectors will be unchanged, and thus random/useless. (They don't participate in training at all.)
If you enable the optional dbow_words=1
parameter, then skip-gram word-vector training will be mixed-in with plain PV-DBOW training. This will be done in an interleaved fashion, so each target word (to be predicted) will be used to train a PV-DBOW doc-vector, then neighboring context word-vectors. As a result, the wv
word-vectors will be trained, and in the "same space" for meaningful comparisons to doc-vectors.
With this option, training will take longer than in plain PV-DBOW (by a factor related to the window
size). For any particular end-purpose, the doc-vectors in this mode might be better (if the word-to-word predictions effectively helped to extend the corpus in useful ways) or worse (if the model spending so much effort on word-to-word predictions effectively diluted/overwhelmed other patterns in the full-doc doc-to-word predictions).