Aim- User inputs a string. I need to compare this input with Sentence 1 and Sentence 2 and find the maximum similarity with either of these sentences.
Current Approach- I tokenize the input and both sentences, find synonym sets of each token and compare maximum similarity by adding similarity for each token using nltk .path_similarity(token1,token2).
Problem- If sentence 1 is short and sentence 2 is long with many tokens, since I sum up individual similarities, the similarity of sentence 2 with input is always more even if most of tokens of input match with sentence 1.
One solution- I can divide the similarity of each sentence with length of sentence and hence I get similarity per token of Sentence. But this approach is too aggressive. Is there an industry standard approach for this?