0

when I use model.infer_vector to compute the vectors, differ order of document results different.

size=200;negative=15; min_count=1;iterNum=20;
windows = 5
modelName = "datasets/dm-sum.bin_"+str(windows)+"_" 
+str(size)+"_"+str(negative)
model = loadDoc2vecModel(modelName)
vecNum = 200

call infer_vector

test_docs = [ x.strip().split() for x in 
codecs.open("datasets/test_keyword_f1", "r", "utf-8").readlines() ]
for item in test_docs:

    print("%s" %(resStr.strip()))
    vecTmp = model.infer_vector(item,  alpha=0.05, steps=20)
    print(vecTmp)

When I executed call infer_vector twice, the results were as follows.

I don't know why did this happen.

this link is the result

eli Yi
  • 1
  • 2

1 Answers1

1

The Doc2Vec training/inference algorithm (in most modes) includes elements of randomization, so you won't necessarily get identical results from repeated runs, unless you force specific extra constraints to force determinism.

Instead, with a strong model and sufficient training/inference (more steps), you should get very-similar-quality vectors on repeated runs.

More steps may be especially important for short texts. It's hard to tell in your screenshot – it'd be better if you pasted the text into your question – but it looks like the space-delimitation in your text results in documents of 13-17 tokens each.

Also, if a model was initially trained on very different kinds of texts, or very little data compared to its overall size (in dimensions/vocabulary), it may not have much generalizable capability for inferring new vectors on new texts. That sort of model-weakness also tends to make the vectors from repeated runs less similar to each other.

(I don't recommend trying to force determinism from run-to-run. To do so is to essentially cover-up the inherent randomness/instability in your setup. It's better to recognize it and adapt tangible measures to tighten the outputs, like a stronger model or more iterations. But if you want to try anyway, there's a discussion of the ways to do so in this gensim issue.)

gojomo
  • 52,260
  • 14
  • 86
  • 115