I'm trying to compare two sentences and get the cosine similarity between them.
I have about 50 sentences, and I used genism's pre-trained doc2vec and trained the model on these 50 sentences to just tweak the weights a little bit. However, the cosine similarity between two sentences is not truly reflecting the similarity. For example, sentence1 is not in English close to sentence2 but their embeddings are very similar.
My question is, how do I go about generally comparing 2 sentences for similarities (as doc2vec is not working for me). It seems to be due to the low amount of training inputs to tweak the weights, but I wonder if there is another technique to achieve this task.
e.g. rough implementation so far
s1 = "This is a sentence"
s2 = "This is also a sentence"
...
s50 ="This is the last sentence
list = [s1,s2..s50]
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[
str(i)]) for i, _d in enumerate(list)]
model = Doc2Vec(vector_size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm=1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=100)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
I then loop through each sentence and perform model.infer_vector(sent_tokens)
to get the embeddings. But as I said, they are not even close to being correct when using similarities.
If I am doing something wrong please let me know.