0

I'm having trouble with the most_similar method in Gensim's Doc2Vec model. When I run most_similar I only get the similarity of the first 10 tagged documents (based on their tags-always from 0-9). For this code I have topn=5, but I've used topn=len(documents) and I still only get the similarity for the first 10 documents

Tagged documents:

tokenizer = RegexpTokenizer(r'\w+')
taggeddoc=[]

for index,wod in enumerate(model_data):
    wordslist=[]
    tagslist=[]
    tokens = tokenizer.tokenize(wod)

    td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(tokens))).split(), str(index)) 
    taggeddoc.append(td)

documents=taggeddoc

Instantiate the model:

model=gensim.models.Doc2Vec(documents, dm=0, dbow_words=1, iter=1, alpha=0.025, min_alpha=0.025, min_count=10)

Train the model:

for epoch in range(100):
    if epoch % 10 == 0:
        print("Training epoch {}".format(epoch))
    model.train(documents, total_examples=model.corpus_count, epochs=model.iter)
    model.alpha -= 0.002
    model.min_alpha = model.alpha

Problem is here (I think):

new = model_data[100].split()
new_vector = model.infer_vector(new)
sims = model.docvecs.most_similar([new_vector], topn=5)
print(sims)

Output:

[('3', 0.3732905089855194), ('1', 0.36121609807014465), ('7', 0.35790640115737915), ('9', 0.3569292724132538), ('2', 0.3521473705768585)]

Length of documents is the same before and after training the model. Not sure why it's only returning similarity for the first 10 documents.

Side question: In anyone's experience, is it better to use Word2Vec or Doc2Vec if the input documents are very short (~50 words) and there are >2,000 documents? Thanks for the help!

J. Collins
  • 83
  • 2
  • 8

1 Answers1

5

The second argument to TaggedDocument(), tags, should be a list-of-tags, not a single string.

By supplying single strings of simple integers like '109', that's being interpreted as the list-of-tags ['1', '0', '9'] - and thus across your whole corpus, only 10 unique tags, the digits 0-9, will be encountered/trained.

Make it a single-tag list, like [str(index)], and you'll get results more like what you expect.

Regarding your side question, both Word2Vec and Doc2Vec work best on large corpuses with millions of words in the training data. A mere 2,000 documents * at most 50 words each, giving at most 100,000 training-words, is very very small for these algorithms. You might be able to eke out some slight results by using a much-smaller size model and many-more training iter passes, but that's not the kind of dataset/problem on which these algorithms work well.

Separately, your training code is totally wrong.

  • If you supply documents to the Doc2Vec initialization, it will do all of its needed vocabulary-discovery and iter training passes automatically – don't call train() any more.

  • And if for some reason you don't provide documents at initialization, you should typically then call both build_vocab() and train() each exactly once.

  • Almost no-one should be changing min_alpha or calling train() more than once in an explicit loop: you are almost certain to do it wrong, as here, where you'll decrement the effective alpha from 0.025 by 0.002 over 100 loops, winding up with a nonsensical negative learning rate of -0.175. Don't so this, and if you copied this approach from what seemed to be a credible online source, please let that source know their code is confused.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks @gojomo. Do you have recommend a model/library for a corpus my size? – J. Collins Feb 18 '18 at 16:32
  • I think mostly the code in the question follows the example here https://rare-technologies.com/doc2vec-tutorial/ which is a bit confusing. – athlonshi Sep 10 '19 at 14:28
  • See the `IMPORTANT NOTE` at the top of that old blog post – there's been a lot of library changes, and improved experience, since that was written. The example notebooks bundled with gensim, including the one linked, are better examples of recommended practice. – gojomo Sep 10 '19 at 16:06