I'm having trouble with the most_similar method in Gensim's Doc2Vec model. When I run most_similar I only get the similarity of the first 10 tagged documents (based on their tags-always from 0-9). For this code I have topn=5, but I've used topn=len(documents) and I still only get the similarity for the first 10 documents
Tagged documents:
tokenizer = RegexpTokenizer(r'\w+')
taggeddoc=[]
for index,wod in enumerate(model_data):
wordslist=[]
tagslist=[]
tokens = tokenizer.tokenize(wod)
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(tokens))).split(), str(index))
taggeddoc.append(td)
documents=taggeddoc
Instantiate the model:
model=gensim.models.Doc2Vec(documents, dm=0, dbow_words=1, iter=1, alpha=0.025, min_alpha=0.025, min_count=10)
Train the model:
for epoch in range(100):
if epoch % 10 == 0:
print("Training epoch {}".format(epoch))
model.train(documents, total_examples=model.corpus_count, epochs=model.iter)
model.alpha -= 0.002
model.min_alpha = model.alpha
Problem is here (I think):
new = model_data[100].split()
new_vector = model.infer_vector(new)
sims = model.docvecs.most_similar([new_vector], topn=5)
print(sims)
Output:
[('3', 0.3732905089855194), ('1', 0.36121609807014465), ('7', 0.35790640115737915), ('9', 0.3569292724132538), ('2', 0.3521473705768585)]
Length of documents is the same before and after training the model. Not sure why it's only returning similarity for the first 10 documents.
Side question: In anyone's experience, is it better to use Word2Vec or Doc2Vec if the input documents are very short (~50 words) and there are >2,000 documents? Thanks for the help!