I am using Word2vec through gensim with Google's pretrained vectors trained on Google News. I have noticed that the word vectors I can access by doing direct index lookups on the Word2Vec
object are not unit vectors:
>>> import numpy
>>> from gensim.models import Word2Vec
>>> w2v = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
>>> king_vector = w2v['king']
>>> numpy.linalg.norm(king_vector)
2.9022589
However, in the most_similar
method, these non-unit vectors are not used; instead, normalised versions are used from the undocumented .syn0norm
property, which contains only unit vectors:
>>> w2v.init_sims()
>>> unit_king_vector = w2v.syn0norm[w2v.vocab['king'].index]
>>> numpy.linalg.norm(unit_king_vector)
0.99999994
The larger vector is just a scaled up version of the unit vector:
>>> king_vector - numpy.linalg.norm(king_vector) * unit_king_vector
array([ 0.00000000e+00, -1.86264515e-09, 0.00000000e+00,
0.00000000e+00, -1.86264515e-09, 0.00000000e+00,
-7.45058060e-09, 0.00000000e+00, 3.72529030e-09,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
... (some lines omitted) ...
-1.86264515e-09, -3.72529030e-09, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00], dtype=float32)
Given that word similarity comparisons in Word2Vec are done by cosine similarity, it's not obvious to me what the lengths of the non-normalised vectors mean - although I assume they mean something, since gensim exposes them to me rather than only exposing the unit vectors in .syn0norm
.
How are the lengths of these non-normalised Word2vec vectors generated, and what is their meaning? For what calculations does it make sense to use the normalised vectors, and when should I use the non-normalised ones?