gensim/docs/notebooks/doc2vec-lee.ipynb results are not repeatable

Question

According to this github tutorial: gensim/docs/notebooks/doc2vec-lee.ipynb I am supposed to be getting about 96% accuracy.

Here is the code using gensim 0.13.4 on jupyter 4.3.1 notebook all from Anaconda Navigator.

import gensim
import os
import collections
import smart_open
import random


# Set file names for train data
test_data_dir='{}'.format(os.sep).join \
([gensim.__path__[0],'test','test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'

def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument \
                      (gensim.utils.simple_preprocess(line), [i])
train_corpus = list(read_corpus(lee_train_file))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=10)
model.build_vocab(train_corpus)
model.train(train_corpus)

ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector] \
           , topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)
    second_ranks.append(sims[1])
collections.Counter(ranks)

In the tutorial for the assessment of the model :

Their output is:

Counter({0: 292, 1: 8})

I am getting

Counter({0: 31,
     1: 24,
     2: 16,
     3: 19,
     4: 16,
     5: 8,
     6: 8,
     7: 10,
     8: 7,
     9: 10,
     10: 12,
     11: 12,
     12: 5,
     13: 9,
      ...

Why am I not getting anything near their accuracy?

Welcome to SO! Your question lacks basic formatting and it is not clear what you are asking. Try editing the question and show the steps that you take to solve the problem. Also, avoid referring to external links, unless completely necessary. Please read: http://stackoverflow.com/help/how-to-ask — bman, Jan 13 '17 at 02:42

score 0 · Answer 1 · answered Jan 15 '17 at 17:10

Thanks for spotting it. The accuracy and the similar documents vary a lot on such a tiny corpus due to random initialisation and different OS numerical libraries. I removed the reference to accuracy in the tutorial.

One needs a large corpus and tens of hours of training to get reproducible doc2vec results.

Also answered on Gensim mailing list

score 0 · Answer 2 · answered Jan 17 '17 at 18:56

I appreciate the response of @Lev Konst above. As he also mentioned this is answered on the Gensim mailing list.

model = gensim.models.doc2vec.Doc2Vec(size=55, min_count=2, iter=60, hs=1, negative=0) produced:

Wall time: 12.5 s
Counter({0: 292, 1: 8})
Wall time: 12 s
Counter({0: 291, 1: 9})
Wall time: 16.4 s
Counter({0: 290, 1: 10})
Wall time: 20.6 s
Counter({0: 295, 1: 5})
Wall time: 21.3 s
Counter({0: 292, 1: 8})
Wall time: 20.6 s
Counter({0: 292, 1: 8})
Wall time: 16.7 s
Counter({0: 296, 1: 4})
Wall time: 15.4 s
Counter({0: 292, 1: 8})
Wall time: 15.3 s
Counter({0: 295, 1: 5})
Wall time: 14.8 s
Counter({0: 292, 1: 8})

It would appear that either increasing the iterations and/or adding hs=1, negative =0 will yield results closer to the notebook's.

The hs=1,negative=0 seems to yield better results though on average. If one merely increases iterations, then on some runs there will be some ranks other than 0 or 1.

However as one can see with the hs=1, negative =0 the rankings are all within the top two rankings.

However, I have been informed on the gensim google groups list that with a dataset of this size less than optimum accuracy and more variation are to be expected.

googlegroups discussion

thanks john

gensim/docs/notebooks/doc2vec-lee.ipynb results are not repeatable

2 Answers2