0

Aim- User inputs a string. I need to compare this input with Sentence 1 and Sentence 2 and find the maximum similarity with either of these sentences.

Current Approach- I tokenize the input and both sentences, find synonym sets of each token and compare maximum similarity by adding similarity for each token using nltk .path_similarity(token1,token2).

Problem- If sentence 1 is short and sentence 2 is long with many tokens, since I sum up individual similarities, the similarity of sentence 2 with input is always more even if most of tokens of input match with sentence 1.

One solution- I can divide the similarity of each sentence with length of sentence and hence I get similarity per token of Sentence. But this approach is too aggressive. Is there an industry standard approach for this?

naves
  • 349
  • 2
  • 10
  • 2
    Read more about cosine similarity, It is one of the traditional comparison for similarity between two documents. Also, simple search about "cosine similarity with synonyms" will give you good papers to read. – Mohamed Gad-Elrab Jun 13 '16 at 18:00
  • do you mean that I can apply cosine similarity on top of the individual similarity values of each input String token with Sentences or do you mean not to use path_similarity and simply use cosine similarity? – naves Jun 14 '16 at 20:10

0 Answers0