I'm trying create an algorithm that's capable of show the top n documents similar to a specific document. For that i used the gensim doc2vec. The code is bellow:
model = gensim.models.doc2vec.Doc2Vec(size=400, window=8, min_count=5, workers = 11,
dm=0,alpha = 0.025, min_alpha = 0.025, dbow_words = 1)
model.build_vocab(train_corpus)
for x in xrange(10):
model.train(train_corpus)
model.alpha -= 0.002
model.min_alpha = model.alpha
model.train(train_corpus)
model.save('model_EN_BigTrain')
sims = model.docvecs.most_similar([408], topn=10)
The sims var should give me 10 tuples, being the first element the id of the doc and the second the score. The problem is that some id's do not correspond to any document in my training data.
I've been trying for some time now to make sense out of the ids that aren't in my training data but i don't see any logic.
Ps: This is the code that i used to create my train_corpus
def readData(train_corpus, jData):
print("The response contains {0} properties".format(len(jData)))
print("\n")
for i in xrange(len(jData)):
print "> Reading offers from Aux array"
if i % 10 == 0:
print ">>", i, "offers processed..."
train_corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(jData[i][1]), tags=[jData[i][0]]))
print "> Finished processing offers"
Being each position of the aux array one array in witch the position 0 is an int (that i want to be the id) and the position 1 a description
Thanks in advance.