3

I'm trying create an algorithm that's capable of show the top n documents similar to a specific document. For that i used the gensim doc2vec. The code is bellow:

model = gensim.models.doc2vec.Doc2Vec(size=400, window=8, min_count=5, workers = 11, 
dm=0,alpha = 0.025, min_alpha = 0.025, dbow_words = 1)

model.build_vocab(train_corpus)

for x in xrange(10):
    model.train(train_corpus)
    model.alpha -= 0.002
    model.min_alpha = model.alpha
    model.train(train_corpus)

model.save('model_EN_BigTrain')

sims = model.docvecs.most_similar([408], topn=10)

The sims var should give me 10 tuples, being the first element the id of the doc and the second the score. The problem is that some id's do not correspond to any document in my training data.

I've been trying for some time now to make sense out of the ids that aren't in my training data but i don't see any logic.

Ps: This is the code that i used to create my train_corpus

def readData(train_corpus, jData):

print("The response contains {0} properties".format(len(jData)))
print("\n")
for i in xrange(len(jData)):
    print "> Reading offers from Aux array"
    if i % 10 == 0: 
        print ">>", i, "offers processed..."

      train_corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(jData[i][1]), tags=[jData[i][0]]))
print "> Finished processing offers"

Being each position of the aux array one array in witch the position 0 is an int (that i want to be the id) and the position 1 a description

Thanks in advance.

JoaoSilva
  • 63
  • 7

2 Answers2

3

Are you using plain integer IDs as your tags, but not using exactly all of the integers from 0 to whatever your MAX_DOC_ID is?

If so, that could explain the appearance of tags within that range. When you use plain ints, gensim Doc2Vec avoids creating a dict mapping provided tags to index-positions in its internal vector-array – and just uses the ints themselves.

Thus that internal vector-array must be allocated to include MAX_DOC_ID + 1 rows. Any rows corresponding to unused IDs are still initialized as random vectors, like all the positions, but won't receive any of the training from actual text examples to push them into meaningful relative positions. It's thus possible these random-initialized-but-untrained vectors could appear in later most_similar() results.

To avoid that, either use only contiguous ints from 0 to the last ID you need. Or, if you can afford the memory cost of the string-to-index mapping, use string tags instead of plain ints. Or, keep an extra record of the valid IDs and manually filter the unwanted IDs from results.

Separately: by not specifying iter=1 in your Doc2Vec model initialization, the default of iter=5 will be in effect, meaning each call to train() does 5 iterations over your data. Oddly, also, your xrange(10) for-loop includes two separate calls to train() each iteration (and the 1st is just using whatever alpha/min_alpha was already in place). So you're actually doing 10 * 2 * 5 = 100 passes over the data, with an odd learning-rate schedule.

I suggest instead if you want 10 passes to just set iter=10, leave default alpha/min_alpha untouched, and then call train() only once. The model will do 10 passes, smoothly managing alpha from its starting to ending values.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Hi gojomo, First thanks for the fast response. I noticed what you were explaining here and yes, if the first id i had was 408, the vector had 409 positions. To solve that i used the id i got from the database as strings and my problem was solved. As for the model of doc2vec i had specified the iter to be 1 but i forgot to change that on this code. I saw this cycle in a doc2vec tutorial and there it was said that they had better results with it. Do you have any opinion on that? Is it beneficial to do the cycle or train with the model with iter as 10 or 0? – JoaoSilva Mar 28 '17 at 10:04
  • 1
    Simply doing the looping yourself, with less-smooth learning-rate decay, is unlikely to offer better results. Mistakenly doing 100 iterations when you think you're only doing 10 might create perceived improvement – if you're mistakenly comparing against `iter=10` (and disregarding the extra time used). Calling `train()` multiple times with explicit alpha/min_alpha tinkering is usually an error-prone complication, & in any case the two `train()`s per iteration seems like a stray mis-edit. So I'd avoid that pattern. – gojomo Mar 28 '17 at 10:21
  • Note edit clarification above, that with explicit `iter=10`, the default `alpha`/`min_alpha` can also be left in place. – gojomo Mar 28 '17 at 10:23
  • 1
    Thanks a lot gojomo!! You were a great help. – JoaoSilva Mar 28 '17 at 10:25
0

I was having this problem as well, I was initializing my doc2vec with the following:

for idx,doc in data.iterrows():
    alldocs.append(TruthDocument(doc['clean_text'], [idx], doc['label']))

I was passing it a dataframe that had some wonk indexes. All I had to do was.

df.reset_index(inplace=True)
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135