2

I need to add and subtract word vectors, for a project in which I use gensim.models.KeyedVectors (from the word2vec-google-news-300 model)

Unfortunately, I've tried but can't manage to do it correctly.

Let's look at the poular example queen ~= king - man + woman.
When I want to subtract man from king and add woman,
I can do this with gensim by

# model is loaded using gensim.models.KeyedVectors.load()
model.wv.most_similar(positive=["king", "woman"], negative=["man"])[0]

which, as expected, returns ('queen', 0.7118192911148071) for the model I use.

Now, to achieve the same with adding and subtracting vectors (all of them are unit-normed), I've tried the following code:

 vec_king, vec_man, vec_woman = model.wv["king"], model.wv["man"], model.wv["woman"]
 result = model.similar_by_vector(vec_king - vec_man + vec_woman)[0]

result in the code above is ('king', 0.7992597222328186) which is not what I'd expect.

What is my mistake?

sthorm
  • 23
  • 7

1 Answers1

0

You're generally doing the right thing, but note:

  • the most_similar() method also disqualifies from its results any of the named words provided - so even if 'king' is (still) the closest word to the result, it will be ignored. Your formulation might very well have 'queen' as the next-closest word, after ignoring the input words - which is all that the 'analogy' tests need.

  • the most_similar() method also does its vector-arithmetic on versions of the vectors that are normalized to unit length, which can result in slightly different answers. If you change your uses of model.wv['king'] to model.get_vector('king', norm=True), you'll get the unit-normed vectors instead.

See also similar earlier answer: https://stackoverflow.com/a/65065084/130288

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    My vectors were already unit-normed, which I forgot to mention (updated the question for this missing piece of info). Ignoring words belonging to the input vectors does the trick, queen will rank highest with the exact same value as model.wv.most_similar(), perfect. – sthorm Jan 08 '21 at 08:42