I'm running a fuzzywuzzy algorithm to compare two large sets of strings against one another.
The strings are company names from two different data sources and I find this to be unique in that there are a lot of matches that look intuitive but are not being discovered.
I get a lot of great matches for scores over 90, but there is a lot of missed data at lower thresholds. Setting the threshold too low brings in a lot of junk, however. For example:
FuzzySearch.extractTop(targetName, sourceName, 3, 10)
Source
"Coca-Cola", "Coca Cola - Other", "Coca-Cola Amatil", "Coca-Cola Company", "Coca-Cola Icecek", "Coca-Cola Services", "CocaCola Amatil", "Cola-Cola", "Colanta"
Target List
"the coca-cola co.", "other"
Produces the following top matches:
"the coca-cola co.":
Coca-Cola, score: 90
Coca Cola - Other, score: 86
Cola-Cola, score: 86
Coca-Cola Company, score: 73
Coca-Cola Icecek, score: 71
Coca-Cola Amatil, score: 68
Coca-Cola Services, score: 68
Colanta, score: 61
CocaCola Amatil, score: 58
"other":
Coca Cola - Other, score: 90
Coca-Cola Services, score: 36
Colanta, score: 33
CocaCola Amatil, score: 20
Coca-Cola Icecek, score: 19
Coca-Cola Amatil, score: 19
Cola-Cola, score: 18
Coca-Cola Company, score: 18
Coca-Cola, score: 18
In the first run, I would hope the words and coca and cola would be given more importance so that something like Coca-cola services would be a higher score to "the coca-cola co." than the 68 it is at. Also, "Coca cola - Other" was a higher match to "Other" (90 score) than it was to "The Coca-cola co." (86 score).
Are there any tweaks I can make to the Fuzzy algorithm or to clean up my data before running it? Or maybe there is another string matching algorithm better suited to this type of data?