In correct answer from doc2vec

Question

I execte doc2vec model for text similarity my code and not obtain reslt

it = LabeledLineSentence(datafiles, labels1)
    
model = gensim.models.Doc2Vec(vector_size=300, min_count=0, alpha=0.025, min_alpha=0.025)
model.build_vocab(it)
    
#training of model

for epoch in range(100):
    print ('iteration '+str(epoch+1))
    model.train(it,total_examples=model.corpus_count,
                epochs=model.epochs)
    
    model.alpha -= 0.002
    model.min_alpha = model.alpha
    
#saving the created model
model.save('doc2vec.model')
print ("model saved")
    
#loading the model
d2v_model = gensim.models.doc2vec.Doc2Vec.load('doc2vec.model')
    
#start testing
seed_text = "consider illegal immoral plagiarism do various"
tokens1 = seed_text.lower().split()
vector1 = d2v_model.infer_vector(tokens1)
    
#to get most similar document with similarity scores using document-index
most_similar = d2v_model.docvecs.most_similar(positive = [vector1] )
    
# output_sentences(most_similar)
print(u'%s %s: %s\n' % ("Most", most_similar[0][1], data[int(most_similar[0][0])]))

It output

Most 0.14691241085529327: M

why not print data bt only M what mean , what can i do to solve the problem Regards

score 0 · Answer 1 · answered May 26 '21 at 14:53

You're using a version of LabeledLineSentence that doesn't match the code that used to be in Gensim. Your version is taking an extra labels1 argument. So, it's non-standard and you should show its code or explain what online example you're basing your code on. Similarly, it's not clear what the values, or indirect contents, of datafiles and labels1 might be.

The M in the output is the result of your code data[int[most_similar[0][0])]. Your code doesn't show what data is, but perhaps it's a string, and the character M is in whatever position int(most_similar[0][0]) evaluates to.

(The value of most_similar[0][0] should be the document-tag that's most-similar to your inferred text-vector, which might be an int or string, depending on how you prepared your training data, in the unshown LabeledLineSentence code. There must have been a document in the trining set with that as a tag.)

The number 0.14691241085529327 is the amount of similarity. That's not very much, so your probe inferred text isn't very similar to any training document. (Perhaps that's indicative of some other problem.)

Your code also shows a few bad practices:

calling train() more than once & using non-default min_alpha you're manipulating yourself - see answer My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong? for more details
setting min_count=0 - almost always a bad idea, Doc2Vec & similar algorithms benefit from ignoring rare words
using a large vector_size=300 - this would only be appropriate with some very large training corpus, the kind you'd most likely use a much-larger-than-default min_count on, and attempt only after gaining success with smaller experiments

I suggest you not trust or use whatever online article motivated this code, & instead start from examples inside the Gensim docs, gradually building them towards your need.

Other generic good steps:

enable logging at the INFO level, watch the output: it may hint at steps that aren't behaving (via progress counts or timing) the way they should
double-check your inputs, especially the it you've created.

For example, if you run:

first_item = next(iter(it))
print('tags: %s\nwords: %s' % (first_item.tags, first_item.words))

Does it print the 1st document you intended to use as training material, with the right words and tags? If not, you've got problems in your data source.

In correct answer from doc2vec

1 Answers1