0

I'm running a fuzzywuzzy algorithm to compare two large sets of strings against one another.

The strings are company names from two different data sources and I find this to be unique in that there are a lot of matches that look intuitive but are not being discovered.

I get a lot of great matches for scores over 90, but there is a lot of missed data at lower thresholds. Setting the threshold too low brings in a lot of junk, however. For example:

FuzzySearch.extractTop(targetName, sourceName, 3, 10)

Source

"Coca-Cola", "Coca Cola - Other", "Coca-Cola Amatil", "Coca-Cola Company", "Coca-Cola Icecek", "Coca-Cola Services", "CocaCola Amatil", "Cola-Cola", "Colanta"

Target List

"the coca-cola co.", "other"

Produces the following top matches:

"the coca-cola co.":
    Coca-Cola, score: 90
    Coca Cola - Other, score: 86
    Cola-Cola, score: 86
    Coca-Cola Company, score: 73
    Coca-Cola Icecek, score: 71
    Coca-Cola Amatil, score: 68
    Coca-Cola Services, score: 68
    Colanta, score: 61
    CocaCola Amatil, score: 58

"other":
    Coca Cola - Other, score: 90
    Coca-Cola Services, score: 36
    Colanta, score: 33
    CocaCola Amatil, score: 20
    Coca-Cola Icecek, score: 19
    Coca-Cola Amatil, score: 19
    Cola-Cola, score: 18
    Coca-Cola Company, score: 18
    Coca-Cola, score: 18

In the first run, I would hope the words and coca and cola would be given more importance so that something like Coca-cola services would be a higher score to "the coca-cola co." than the 68 it is at. Also, "Coca cola - Other" was a higher match to "Other" (90 score) than it was to "The Coca-cola co." (86 score).

Are there any tweaks I can make to the Fuzzy algorithm or to clean up my data before running it? Or maybe there is another string matching algorithm better suited to this type of data?

IcedDante
  • 6,145
  • 12
  • 57
  • 100
  • It can be beneficial to remove special characters if the algorithm doesn’t ignore them. Sometimes it’s best to smush the words together, but I’m not familiar with fuzzywuzzy, so it might not help (or be actively detrimental). – Dave Newton Aug 12 '20 at 01:11
  • If you are not looking into case sensitivity it might me good idea to use one case sensitivity when calling algorithm - for example everything to upper or lower. That might be beneficial since algorithm might give different results for same case insensitive characters. – Norbert Dopjera Aug 12 '20 at 02:12
  • Which usage are you currently using? Simple ratio, partial ratio...? Something more customized? There are a few different basic usages - see examples [here](https://github.com/xdrop/fuzzywuzzy). – andrewJames Aug 12 '20 at 12:12
  • There are other similarity & distance algorithms in the Apache [Commons-Text](https://commons.apache.org/proper/commons-text/userguide.html) project if you want to investigate other approaches. – andrewJames Aug 12 '20 at 12:13
  • @andrewjames I'm using `extractTop(...)` as you can see in the question. I also experimented with token search – IcedDante Aug 13 '20 at 23:41
  • 1
    @IcedDante - apologies - it was there all along (weighted ratio). The best I have managed in the past was to (a) remove punctuation, then (b) fold to ascii (remove diacritics), and then (c) transform to lower case. Then I used a (commercial) tool which implemented a first-pass Levenshtein, followed by a second-pass fuzzy matcher. Even then, there were false positives and negatives which needed to be handled with manual intervention. Fortunately, the volume was not large. – andrewJames Aug 14 '20 at 00:15

0 Answers0