0

Levenshtein distance is an approach for measuring the difference between words, but not so for phrases.

Is there a good distance metric for measuring differences between phrases?

For example, if phrase 1 is made of n words x1 x2 x_n, and phrase 2 is made of m words y1 y2 y_m. I'd think they should be fuzzy aligned by words, then the aligned words should have a score about how similar they are, and some kind of gap penalty should be applied for non aligned words. These positive scores and negative scores should be aggregated in some way. There seem to be some heuristics involved.

Is there an existing solution for measuring the similarity between phrases? Python is preferred but other solution is also fine. Thanks.

user1424739
  • 11,937
  • 17
  • 63
  • 152

2 Answers2

1

You can also measure the similarity between two phrases using Levenshtein distance, threating each word as a single element. When you have strings of unequal sizes you can use the Smith-Waterman or the Needleman-Wunsch algorithm. Those algorithms are widely used in bioinformatics and the implementation can be found in the biopython package. You can also tokenize the words in the phrases and measure the frequency of each token in each phrase, that will result in an array of frequencies for each phrase. From that array you can measure the pairwise similarity using any vector distance such as euclidean distance or cosine similarity. The tokenization of the phrases can be done with the nltk package, and the distances can be measured with scipy. Hope it helps.

TavoGLC
  • 889
  • 11
  • 14
0

Take a look at FuzzyWuzzy:

>>> from fuzzywuzzy import fuzz

>>> s1 = "this is a sentence used for testing"
>>> s2 = "while this is another sentence also used for testing"
>>> s3 = "I am a completely unrelated string"

>>> fuzz.partial_ratio(s1, s2)
80
>>> fuzz.partial_ratio(s1, s3)
52
>>> fuzz.partial_ratio(s2, s3)
43

It also includes other modes of comparison that account for out-of-order tokens, etc.

Avish
  • 4,516
  • 19
  • 28
  • Does it consider words? Or still consider a phrase as a string? – user1424739 Apr 11 '19 at 18:31
  • Can you clarify your question? – Avish Apr 11 '19 at 18:35
  • Does it first compare words between phrases and compute the phrase difference using word scores? – user1424739 Apr 11 '19 at 18:54
  • I don't think it does, but you'll have better luck checking the project docs. I got a similarity of 83 with `"thisisasentenceusedfortesting"`, which probably indicates it doesn't care about words. However, some of its other methods like `token_sort` and `token_set` do care about words. – Avish Apr 11 '19 at 18:56
  • OK. Also, the difference between different word forms of the same word (e.g., plural vs singular) should be smaller than the difference between two different words (e.g., “took” vs “look”). Is there a similarity score that can take care of this? – user1424739 Apr 11 '19 at 19:01
  • There are more complex approaches that probably involve stemming or even semantic assignment (e.g. using wordnet, word2vec etc.), but I'm not familiar with specific ones. NLTK would be a good place to start looking. – Avish Apr 11 '19 at 19:02