Why are almost all cosine similarities positive between word or document vectors in gensim doc2vec?

Question

I have calculated document similarities using Doc2Vec.docvecs.similarity() in gensim. Now, I would either expect the cosine similarities to lie in the range [0.0, 1.0] if gensim used the absolute value of the cosine as the similarity metric, or roughly half of them to be negative if it does not.

However, what I am seeing is that some similarities are negative, but they are very rare – less than 1% of pairwise similarities in my set of 30000 documents.

Why are almost all of the similarities positive?

gojomo · Accepted Answer · 2017-06-05T14:51:37.860

There's no inherent guarantee in Word2Vec/Doc2Vec that the generated set of vectors is symmetrically distributed around the origin point. They could be disproportionately in some directions, which would yield the results you've seen.

In a few tests I just did on the toy-sized dataset ('lee corpus') used in the bundled gensim docs/notebooks/doc2vec-lee.ipynb notebook, checking the cosine-similarities of all documents against the first document, it vaguely seems that:

using hierarchical-softmax rather than negative sampling (hs=1, negative=0) yields a balance between >0.0 and <0.0 cosine-similarities that is closer-to (but not yet quite) half and half
using a smaller number of negative samples (such as negative=1) yields a more balanced set of results; using a larger number (such as negative=10) yields relatively more >0.0 cosine-similarities

While not conclusive, this is mildly suggestive that the arrangement of vectors may be influenced by the negative parameter. Specifically, typical negative-sampling parameters, such as the default negative=5, mean words will be trained more times as non-targets, than as positive targets. That might push the preponderance of final coordinates in one direction. (More testing on larger datasets and modes, and more analysis of how the model setup could affect final vector positions, would be necessary to have more confidence in this idea.)

If for some reason you wanted a more balanced arrangement of vectors, you could consider transforming their positions, post-training.

There's an interesting recent paper in the word2vec space, "All-but-the-Top: Simple and Effective Postprocessing for Word Representations", that found sets of trained word-vectors don't necessarily have a 0-magnitude mean – they're on average in one direction from the origin. And further, this paper reports that subtracting the common mean (to 're-center' the set), and also removing a few other dominant directions, can improve the vectors' usefulness for certain tasks.

Intuitively, I suspect this 'all-but-the-top' transformation might serve to increase the discriminative 'contrast' in the resulting vectors.

A similar process might yield similar benefits for doc-vectors – and would likely make the full set of cosine-similarities, to any doc-vector, more balanced between >0.0 and <0.0 values.

Why are almost all cosine similarities positive between word or document vectors in gensim doc2vec?

1 Answers1