1

I have two lists of companies (> 2k entries in the longer list) in different formats that I need to unify. I know that both formats share a stub about 80% of the time, so I'm using fuzzy match to compare both lists:

def get_fuzz_score(str1, str2):

    from fuzzywuzzy import fuzz
    partial_ratio = fuzz.partial_ratio(str1, str2)
    return partial_ratio


a = ['Express Scripts', 'Catamaran Corp', 'Banmedica SA (96.7892%)', 'WebMD', 'ODC', 'Caremerge LLC (Stake%)']
b = ['Doctor on Demand', 'Catamaran', 'Express Scripts Holding Corp', 'ODC, Inc.', 'WebMD Health Services', 'Banmedica']

for i in b:
    for j in a:
        if get_fuzz_score(i, j) > 80:
            # process

I'd appreciate thoughts on how to optimize this task for performance (e.g., not have to use 2 for loops).

lajulajay
  • 355
  • 3
  • 4
  • 18

3 Answers3

3

first, I would move the import from fuzzywuzzy import fuzz from the function to the start of the file.

Next, it appears that you want to check every element, so you are comparing all2all anyway and I don't see simple workaround that.

If the data are 'nice' than you could do some simple heuristic e.g. on a first letter (from the examples you've posted - but that depends on the data).

Best regards

P.s. I would comment If my score would be high enough.

Marek Schwarz
  • 578
  • 6
  • 10
3

fuzzywuzzy provides a process.extract* family of functions to help with this, e.g:

from fuzzywuzzy import process

a = ['Express Scripts', 'Catamaran Corp', 'Banmedica SA (96.7892%)', 'WebMD', 'ODC', 'Caremerge LLC (Stake%)']
b = ['Doctor on Demand', 'Catamaran', 'Express Scripts Holding Corp', 'ODC, Inc.', 'WebMD Health Services', 'Banmedica']

for name in a:
    print(name, process.extract(name, b, limit=3))

will print out each name in a and the three top matches from b.

this is still O(n**2) but because this library is open source code you get to see how extract is defined and maybe just do the preprocessing once rather than every time which would hopefully speed things up a lot

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
2

I assume you installed both fuzzywuzzy AND python-Levenshtein. The installation of the second package failed and therefore i got a message:

warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

You can use itertools.product to create the cartesian product:

from itertools import product
from fuzzywuzzy import fuzz

def get_fuzz_score(str1, str2):
    partial_ratio = fuzz.partial_ratio(str1, str2)
    return partial_ratio


a = ['Express Scripts', 'Catamaran Corp', 'Banmedica SA (96.7892%)', 'WebMD', 'ODC', 'Caremerge LLC (Stake%)']
b = ['Doctor on Demand', 'Catamaran', 'Express Scripts Holding Corp', 'ODC, Inc.', 'WebMD Health Services', 'Banmedica']

for first, second in product(a, b):
    if get_fuzz_score(first, second) > 80:
        # process

If your function get_fuzz_score doesn't grow you can make it obsolete:

from itertools import product
from fuzzywuzzy import fuzz  # 

a = ['Express Scripts', 'Catamaran Corp', 'Banmedica SA (96.7892%)', 'WebMD', 'ODC', 'Caremerge LLC (Stake%)']
b = ['Doctor on Demand', 'Catamaran', 'Express Scripts Holding Corp', 'ODC, Inc.', 'WebMD Health Services', 'Banmedica']

for first, second in product(a, b):
    if fuzz.partial_ratio(first, second) > 80:
        pass  # process
Frank
  • 1,959
  • 12
  • 27
  • 1
    itertools.product might be exactly what I need -- will give it a try and report back – lajulajay Oct 21 '19 at 16:01
  • 1
    re: python-Levenshtein warning, i fought with that for a long time and ended up having to switch to a Conda environment to fix – lajulajay Oct 21 '19 at 16:03