1

I am looking for a way to output the match percentage while between two strings (ex: names) while also taking into consideration they might be the same but with the words in a different order. I tried using SequenceMatcher() but the results are only partialy satisfying:

a = "john doe"
b = "jon doe"
c = "doe john"
d = "jon d"
e = 'john do'

s = SequenceMatcher(None, a, b)
s.ratio()
0.9333333333333333

s = SequenceMatcher(None, a, c)
s.ratio()
0.5

s = SequenceMatcher(None, a, d)
s.ratio()
0.7692307692307693

s = SequenceMatcher(None, a, e)
s.ratio()
0.9333333333333333

I am ok with all but the second result. I notice that it does not take into consideration that c is contains the same words as a but in a different order.

Is there any other way to match strings and obtain a higher matching percentage in the case I mentioned above. It should also be taken into consideration that names may contain more than two words.

Thank you!

Prune
  • 76,765
  • 14
  • 60
  • 81
calin.bule
  • 95
  • 1
  • 15

2 Answers2

2

That depends on what you expect for the enhanced matching. If you think the second one should be 1.0, then it's simple: split the string into words, sort the words, then apply SM (SequenceMatcher). If you want a match penalty on the sorting, you could use any of the transformation functions to measure the distance between the two lists of words, and use that as a factor on the eventual match.

Does that help move you along?

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Hi! Thank you for your answer. I do not know how to compute the distance between the lists of words. Could you elaborate on that? – calin.bule Nov 02 '18 at 12:52
  • If you don't know how to compute the distance, then you have some research to do -- see the posting guidelines and Vishnudev's references. – Prune Nov 02 '18 at 15:47
  • Hello again. It actually did help. I removed the special characters from the strings, then split them into words. The comparison though is made with SequenceMatcher because I work on a company computer and I can install Anaconda but not any of the packages that are not included with it. The string with the lowest number of words is compared with the permutations of the other that have the same number of words. The final result is the highest match. It works well for names of persons but given a high number of rows it takes a lot of time to execute. I am working on parallelizing the execution. – calin.bule Nov 08 '18 at 09:45
  • Okay ... so you have an application more complex than single checks of string pairs? If you're doing this repeatedly, you need to sort all of the names only *once*. Then use "groupby" on the name length to identify classes useful to check against one another. If you need to find *all* combinations of highest match, this moves into the realm of graph distances. – Prune Nov 08 '18 at 17:16
1

You could go with other string similarity algorithms. Choosing of similarity algorithm is widely on the basis of usage. Choose carefully!

So, The library textdistance has many text distance algorithms. The best for your case would be to use Sorensen dice similarity or Jaccard similarity.

Code:

import textdistance as td

a = "john doe"
c = "doe john"
print(td.sorensen.normalized_similarity(a,c))

Output:

1.0
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
  • Hi! First of all thank you for your answer. I'll give a look at the algorithms that you proposed to se which one would be best for my case. I do have one question though: how does the Sonrensen work if the two strings are different in legth? For example one might be "John Doe" and the other "John Jack Doe". For simplifying things we assume that it is the same person and ignore potential false positives. Thanks again. – calin.bule Nov 02 '18 at 12:47
  • @calin.bule It is based on intersection of sets. Please read the content in the link for more details. It will surely have a lower score for different length. – Vishnudev Krishnadas Nov 09 '18 at 13:24