0

I'm trying to compare two sentences and get the cosine similarity between them.

I have about 50 sentences, and I used genism's pre-trained doc2vec and trained the model on these 50 sentences to just tweak the weights a little bit. However, the cosine similarity between two sentences is not truly reflecting the similarity. For example, sentence1 is not in English close to sentence2 but their embeddings are very similar.

My question is, how do I go about generally comparing 2 sentences for similarities (as doc2vec is not working for me). It seems to be due to the low amount of training inputs to tweak the weights, but I wonder if there is another technique to achieve this task.

e.g. rough implementation so far

s1 = "This is a sentence"
s2 = "This is also a sentence"
...
s50 ="This is the last sentence

list = [s1,s2..s50]

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[
                                      str(i)]) for i, _d in enumerate(list)]
model = Doc2Vec(vector_size=vec_size,
                        alpha=alpha,
                        min_alpha=0.00025,
                        min_count=1,
                        dm=1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
   print('iteration {0}'.format(epoch))
   model.train(tagged_data,
   total_examples=model.corpus_count,
   epochs=100)
   # decrease the learning rate
   model.alpha -= 0.0002
   # fix the learning rate, no decay
   model.min_alpha = model.alpha

I then loop through each sentence and perform model.infer_vector(sent_tokens) to get the embeddings. But as I said, they are not even close to being correct when using similarities.

If I am doing something wrong please let me know.

Richard Wilson
  • 297
  • 4
  • 17

1 Answers1

0

There is no "gensim's pre-trained doc2vec", so if in fact you're using some pre-trained model from some 3rd party, you'd need to descriobe the source to know what's in play here. (However, your code seems to show a new model trained up from only 50 sentences.)

50 sentences is not enough to train Doc2Vec (or related algorithms like Word2Vec or FastText). They need bulk data, with many sublty-varying, realistic usage examples of every word of any interest, to create useful vectors.

It is almost always a bad idea to use min_count=1 with Doc2Vec & similar algorithms, as they depend on the influence of multiple varied contexts for a word. If there's only 1 use, or a few, of a word then any vector learned for tha word is likely to be idiosyncratic to that appearance and not of generalizable usefulness. Plus, the existence of many such rare words (in usual natural-language corpora) can mean such junk words serve as noise in the model to dilute and interfere-with the training of other words for which there are suitable examples. The models usually work better if you discard such infrequent words entirely - and that's why the default is min_count=5.

I've not seen any good write-up of someone doing tiny followup tuning, with a small amount of new data, on a pretrained Doc2Vec model – so I wouldn't recommend attempting that to someone just starting out with Doc2Vec. (If it works at all, it'll require expert experimentation & tuning.)

It's also almost always a misguided & error-prone idea to be calling .train() more than once in a loop, and adjusting alpha/min_alpha outside the usual default and automatic management. See this answer for more details: https://stackoverflow.com/a/62801053/130288

If you train properly with a good-sized corpus, then check pairwise similarities of texts that the training data was representative-of, you should see more sensible similarity values.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • thanks for this! Explains a lot. I did sense it wasn't pre-trained. Would you happen to know any location of a pre trained model? For example with word2vec, you can use googles Word2vec. – Jeff Jefferson Aug 08 '21 at 01:45