2

I want to name match appropriately, but as can be seen below it's not the match I wanted is there any way to get around this? I just want Mr Mark Longfield to be preferred over Mr Laurence Boode as it is more likely to be the correct match.

from fuzzywuzzy import fuzz, process

str = 'Mr Lonfield'
L = list('Mr Laurence Boode', 'Mr Mark Longfield')
print(process.extractOne(str, L))

Output: ('Mr Laurence Boode', 86)

Is this more to do with the structure of the list and strings more than anything else. So if i removed peoples first name of course I'd be more likely to match but I'd rather have their full name.

Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40

1 Answers1

1

For what it's worth, the following will produce your expected match:

print(process.extractOne(str, L, scorer=fuzz.token_set_ratio))

In this case, you will get:

('Mr Mark Longfield', 79)

Laurence Boode's score is 43 in this scenario.

I say for what it's worth because I was not able to find much detail on how this works, outside of looking at the source code (link below).

Also, you would of course need to test how well this works on your larger population.

There are other scorer options you can test with. One of those may be an even better fit. See here for details.

I used token_set_ratio in Java's port of this library a while ago, for matching movie titles. If I recall, it worked well enough for my needs, but there were definitely cases where I got false positives - but that was due to the nature of certain movie titles. That probably does not apply to your scenario.

I hope it helps.

Update

Some notes from comments in the source:

A token_set is the set of alphanumeric tokens in a string (splitting on whitespace).

Functions:

token_set_ratio: Returns a measure of the sequences' similarity between 0 and 100.

token_sort_ratio: Returns a measure of the sequences' similarity between 0 and 100, but sorting the tokens before comparing.

partial_ratio: Returns the ratio of the most similar substring as a number between 0 and 100.

partial_token_set_ratio: Return the ratio of the most similar substring as a number between 0 and 100.

partial_token_sort_ratio: Return the ratio of the most similar substring as a number between 0 and 100 but sorting the tokens before comparing.

There are some additional usage examples in the Java port documentation.

andrewJames
  • 19,570
  • 8
  • 19
  • 51