gensim most_similar with positive and negative, how does it work?

Question

I was reading this answer That says about Gensim most_similar:

it performs vector arithmetic: adding the positive vectors, subtracting the negative, then from that resulting position, listing the known-vectors closest to that angle.

But when I tested it, that is not the case. I trained a Word2Vec with Gensim "text8" dataset and tested these two:

model.most_similar(positive=['woman', 'king'], negative=['man'])

>>> [('queen', 0.7131118178367615), ('prince', 0.6359186768531799),...]

model.wv.most_similar([model["king"] + model["woman"] - model["man"]])

>>> [('king', 0.84305739402771), ('queen', 0.7326322793960571),...]

They are clearly not the same. even the queen score in the first is 0.713 and on the second 0.732 which are not the same.

So I ask the question again, How does Gensim most_similar work? why the result of the two above are different?

It is expected. See https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935a85 — mon, Jan 22 '22 at 01:46
FYI, I've implemented and replicated the `most_similar` results [here](https://github.com/viniciusarruda/word2vec). — ViniciusArruda, Aug 23 '23 at 14:46

gojomo · Accepted Answer · 2020-11-29T21:06:34.843

The adding and subtracting isn't all that it does; for an exact description, you should look at the source code:

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#LC690:~:text=def%20most_similar,self%2C

You'll see there that the addition and subtraction is on the unit-normed version of each vector, via the get_vector(key, use_norm=True) accessor.

If you change your use of model[key] to model.get_vector(key, use_norm=True), you should see your outside-the-method calculation of the target vector give the same results as letting the method combine the positive and negative vectors.

gensim most_similar with positive and negative, how does it work?

1 Answers1

Linked