I am struggling with Doc2Vec and I cannot see what I am doing wrong. I have a text file with sentences. I want to know, for a given sentence, what is the closest sentence we can find in that file.
Here is the code for model creation:
sentences = LabeledLineSentence(filename)
model = models.Doc2Vec(size=300, min_count=1, workers=4, window=5, alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
model.train(sentences, epochs=50, total_examples=model.corpus_count)
model.save(modelName)
For test purposes, here is my file:
uduidhud duidihdd
dsfsdf sdf sddfv
dcv dfv dfvdf g fgbfgbfdgnb
i like dogs
sgfggggggggggggggggg ggfggg
And here is my test:
test = "i love dogs".split()
print(model.docvecs.most_similar([model.infer_vector(test)]))
No matter what parameter for training, this should obviously tell me that the most similar sentence is the 4th one (SENT_3 or SENT_4, I don't know how their indexes work, but the sentence labels are this form). But here is the result:
[('SENT_0', 0.15669342875480652),
('SENT_2', 0.0008485736325383186),
('SENT_4', -0.009077289141714573)]
What am I missing ? And if I try with the same sentence (I LIKE dogs), I have SENT_2, then 1 then 4... I really don't get it. And why such low numbers ? And when I run few times in a row with a load, I don't get the same results either.
Thanks for your help