2

Gensim Doc2Vec infer_vector on paragraphs with unseen words generates vectors that differ based on the characters in the unsween words.

for i in range(0, 2):
    print(model.infer_vector(["zz"])[0:2])
    print(model.infer_vector(["zzz"])[0:2])
    print(model.infer_vector(["zzzz"])[0:2])
    print("\n")

[ 0.00152548 -0.00055992]
[-0.00165872 -0.00047997]
[0.00125548 0.00053445]


[ 0.00152548 -0.00055992] # same as in previous iteration
[-0.00165872 -0.00047997]
[0.00125548 0.00053445]

I am trying understand how unseen words affect initialization of the infer_vector. It looks like different characters will produce different vectors. Trying to understand why.

Stanley Kirdey
  • 602
  • 5
  • 20

1 Answers1

4

Unseen words are ignored for the actual process of iterative inference: tuning a vector to better-predict a text's words, according to a frozen Doc2Vec model.

However, inference starts with a pseudorandomly-initialized vector. And, the full set of tokens passed-in (including unknown words) are used as the seed for that random-initialization.

This seeded initialization is done as a potential small aid to those seeking fully-reproducible inference – but in practice, seeking such exact-reproduction, rather than just run-to-run similarity, is usually a bad idea. See the gensim FAQs Q11 & Q12 about varying results from run-to-run for more details.

So what you're seeing is:

  • your different tokenized texts each cause a pseudorandom, but deterministic with respect to the source text, vector initialization
  • since no words are known, inference afterwards is a no-op: there are no words to predict
  • the pseudorandom initialized vector is returned

The infer_vector() method should probably log a warning, or return a flag value (like perhaps the origin vector), as a better hint that nothing meaningful is actually happening.

But you may wish to check any text before you supply it to infer_vector() – if none of its words are in the d2v_model.wv, then inference will simply be returning a small random initialization vector.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Quick question, if I combine a paragraph with ['known word', 'unknown word'] - should unknwon word still affect pseudo random initialization? It does now, and it seems like it changes vectors based on small sentences quite a lot. – Stanley Kirdey Dec 25 '19 at 23:42
  • 1
    Yes, it's the string representation of the whole list-of-tokens that affects the seeding, so the seeding for `[known_word_a,]`, `[known_word_a, unknown_word_b]`, & `[known_word_a, unknown_word_c]` will all be different. If you *needed* the same seeding, you could discard unknown words before passing to `infer_vector()` – eg: `words = [word for word in words if word in d2v_model.wv]`. – gojomo Dec 26 '19 at 00:57
  • 1
    If the vectors *after* inference are very-different from each other, then inference isn't working very well – either the model has representational problems, or insufficient effort is going into inference. (Consider specifying an `epochs` much larger than the default, which will only be `5` if you didn't specify more `epochs` when creating the model.) And note that inference results for tiny sentences of just 1 or a few words may always be a bit weird/extreme: the best doc-vectors likely come from the balanced tug-of-war between trying to predict all a text's words. – gojomo Dec 26 '19 at 01:01
  • Great advice on discarding unknown words before seeding, thank you! – Stanley Kirdey Dec 26 '19 at 08:12