There's no inherent guarantee in Word2Vec/Doc2Vec that the generated set of vectors is symmetrically distributed around the origin point. They could be disproportionately in some directions, which would yield the results you've seen.
In a few tests I just did on the toy-sized dataset ('lee corpus') used in the bundled gensim docs/notebooks/doc2vec-lee.ipynb
notebook, checking the cosine-similarities of all documents against the first document, it vaguely seems that:
- using hierarchical-softmax rather than negative sampling (
hs=1, negative=0
) yields a balance between >0.0 and <0.0 cosine-similarities that is closer-to (but not yet quite) half and half
- using a smaller number of negative samples (such as
negative=1
) yields a more balanced set of results; using a larger number (such as negative=10
) yields relatively more >0.0 cosine-similarities
While not conclusive, this is mildly suggestive that the arrangement of vectors may be influenced by the negative
parameter. Specifically, typical negative-sampling parameters, such as the default negative=5
, mean words will be trained more times as non-targets, than as positive targets. That might push the preponderance of final coordinates in one direction. (More testing on larger datasets and modes, and more analysis of how the model setup could affect final vector positions, would be necessary to have more confidence in this idea.)
If for some reason you wanted a more balanced arrangement of vectors, you could consider transforming their positions, post-training.
There's an interesting recent paper in the word2vec space, "All-but-the-Top: Simple and Effective Postprocessing for Word Representations", that found sets of trained word-vectors don't necessarily have a 0-magnitude mean – they're on average in one direction from the origin. And further, this paper reports that subtracting the common mean (to 're-center' the set), and also removing a few other dominant directions, can improve the vectors' usefulness for certain tasks.
Intuitively, I suspect this 'all-but-the-top' transformation might serve to increase the discriminative 'contrast' in the resulting vectors.
A similar process might yield similar benefits for doc-vectors – and would likely make the full set of cosine-similarities, to any doc-vector, more balanced between >0.0 and <0.0 values.