0

I am using the Doc2Vec model from gensim (4.1.2) python library.

I trained model on my corpus of documents and used infer_vector(). Than I saved model and try to use infer_vector on same text, but I get totally different vector. What is wrong?

Here is example of code:

doc2vec_model.infer_vector(["system", "response"])
array([-1.02667394e-03, -2.73817539e-04, -2.08510624e-04,  1.01583987e-03,
       -4.99124289e-04,  4.82861622e-04, -9.00296785e-04,  9.18195175e-04,
....
doc2vec_model.save('model/doc2vec')

If I load saved model

fname = "model/model_doc2vec"
model = Doc2Vec.load(fname)
model.infer_vector(["system", "response"])
array([-1.07945153e-03,  2.80674692e-04,  4.65555902e-04,  6.55420765e-04,
        7.65898672e-04, -9.16261168e-04,  9.15124183e-05, -5.18970715e-04,
....
sergzemsk
  • 164
  • 3
  • 12

1 Answers1

1

First, there's a natural amount of variance from one run of infer_vector() to another, that's inherent to how the algorithm works. The vector will be at least a little different every time you run it, even without the save/load between. For more details, see:

Q12: I've used Doc2Vec infer_vector() on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake? (doc2vec inference non-determinism)

Second, a 2-word text is a minimal corner-case on which Doc2Vec is less likely to work very well. It's better on texts that are at least dozens of words long. In particular, both the training & inference are processes that work in proportion to the number of words in a text. So a 100-word text, that goes through inference to find a new vector, will get 50x more 'adjustment nudges' than a mere 2-word text - and thus tend to be somewhat more stable, run-to-run, than a tiny text. (As mentioned in the FAQ item linked above, increasing the epochs may help a bit, making a small text a little more like a longer text – but I would still expect any small text to be more at the mercy of vagaries of the random initialization, and random smpling during incremental adjustment, than a longer text.)

Finally, often other problems in the model – like insufficient training data, overfitting (expecially when the model is too large for the amount of training data), or other suboptimal parameters or errors during training can make a model that's especially inconsistent from inference to inference.

The vectors from repeated inferences will never be identical, but they should be fairly close, when parameters are good & training is sufficient. (In fact, one indirect way to test if a model is doing anything useful is to check, at then end of training, how often a re-inferred vector for training texts is the top, or one of the few top, neighbors of the same text's vector from bulk training.)

One possible errors could be too few epochs – the default of 5 inherited from Word2Vec is often too few, with 10 or 20 often being better. (Or, if you're struggling with minimal amounts of data, even more epochs can help eke out some results – though really, this algorithm needs lots of training data. Published results typically use at least tens-of-thousands, if not millions, of separate training docs, each at least dozens, but ideally hundreds or in some cases thousands of words long. With less data (and possibly too many vector_size dimensions for tiny training data), models will be 'looser' or more arbitrary when modeling new data.

Another very common error is to follow some of the bad tutorials online which include calling .train() many times in your own training loop, (mis-)managing the training alpha manually. This is almost never a good idea. See this other answer for more details on this common error:

My Doc2Vec code, after many loops/epochs of training, isn't giving good results. What might be wrong?

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • I tried every suggestion to discussion on GitHub but vectors from same model are absolutely different. My model trained on corpus with about 200k of text with various lengths. I tried from 10 to 300 epochs, but resultat is the same. The problem is I can't use trained Doc2Vec in my program because every infer I get absolutely different vectors for same text. – sergzemsk Mar 16 '22 at 07:24
  • That they are "different" is expected. The thing that matters in real corpuses & uses is: how different? For example, for typical applications, it usually doesn't matter if the vectors are different, but still close, and thus have very similar lists of closest-neighbors. You may want to edit your question (or post a new question) with more info about your training – corpus characteristic & code used – and specific goals, with a quantification (rather than eyeball-comparison) of how far results are from what you need. – gojomo Mar 16 '22 at 17:34