Inaccurate similarities results by doc2vec using gensim library

Question

I am working with Gensim library to train some data files using doc2vec, while trying to test the similarity of one of the files using the method model.docvecs.most_similar("file") , I always get all the results above 91% with almost no difference between them (which is not logic), because the files do not have similarities between them. so the results are inaccurate.

Here is the code for training the model

model = gensim.models.Doc2Vec(vector_size=300, min_count=0, alpha=0.025, min_alpha=0.00025,dm=1)
model.build_vocab(it)
for epoch in range(100):
    model.train(it,epochs=model.iter, total_examples=model.corpus_count)
    model.alpha -= 0.0002
    model.min_alpha = model.alpha
model.save('doc2vecs.model')
model_d2v = gensim.models.doc2vec.Doc2Vec.load('doc2vecs.model')
sim = model_d2v.docvecs.most_similar('file1.txt')
print sim

**this is the output result**

[('file2.txt', 0.9279470443725586), ('file6.txt', 0.9258157014846802), ('file3.txt', 0.92499840259552), ('file5.txt', 0.9209873676300049), ('file4.txt', 0.9180108308792114), ('file7.txt', 0.9141069650650024)]

what am I doing wrong ? how could I improve the accuracy of results ?

score 4 · Accepted Answer · answered Sep 12 '18 at 17:23

What is your it data, and how is it prepared? (For example, what does print(iter(it).next()) do, especially if you call it twice in a row?)

By calling train() 100 times, and also retaining the default model.iter of 5, you're actually making 500 passes over the data. And the first 5 passes will use train()s internal, effective alpha-management to lower the learning rate gradually to your declared min_alpha value. Then your next 495 passes will be at your own clumsily-managed alpha rates, first back up near 0.025 and then lower each batch-of-5 until you reach 0.005.

None of that is a good idea. You can just call train() once, passing it your desired number of epochs. A typical number of epochs in published work is 10-20. (A bit more might help with a small dataset, but if you think you need hundreds, something else is probably wrong with the data or setup.)

If it's a small amount of data, you won't get very interesting Word2Vec/Doc2Vec results, as these algorithms depend on lots of varied examples. Published results tend to use training sets with tens-of-thousands to millions of documents, and each document at least dozens, but preferably hundreds, of words long. With tinier datasets, sometimes you can squeeze out adequate results by using more training passes, and smaller vectors. Also using the simpler PV-DBOW mode (dm=0) may help with smaller corpuses/documents.

The values reported by most_similar() are not similarity "percentages". They're cosine-similarity values, from -1.0 to 1.0, and their absolute values are less important than the relative ranks of different results. So it shouldn't matter if there are a lot of results with >0.9 similarities – as long as those documents are more like the query document than those lower in the rankings.

Looking at the individual documents suggested as most-similar is thus the real test. If they seem like nonsense, it's likely there are problems with your data or its preparation, or training parameters.

For datasets with sufficient, real natural-language text, it's typical for higher min_count values to give better results. Real text tends to have lots of low-frequency words that don't imply strong things without many more examples, and thus keeping them during training serves as noise making the model less strong.

score 0 · Answer 2 · answered Sep 12 '18 at 01:56

0

Without knowing the contents of the documents, here are two hints that might help you.

Firstly, 100 epochs will probably be too small for the model to learn the differences.
also, check the contents of the documents vs the corpus you are using. Make sure that the vocab is relevant for your files?

answered Sep 12 '18 at 01:56

Simas Joneliunas

2,890
20
28
35

Thanks for your for your answer. The files contain some articles copied from Wikipedia just for test ( some about sports, politics, programming). But I am supposed to use the algorithm on comments in a blog. - I tried to follow your advice and use 3000 epochs, this time the result are between 30% and 98% but still inaccurate ( no similarities between sports articles for example). -for the corpus (if I get you right) I used stopwords imported nltk.corpus but I am not sure how to choose the right vocab for a specific problem. I would appreciate any help/tips you could provide :) – Abdessamad139 Sep 12 '18 at 03:02

Inaccurate similarities results by doc2vec using gensim library

2 Answers2