Speeding up fuzzy match on large list

Question

I am working on a project that uses fuzzy logic on a list of names that could go about 100,000 unique records. On the recent screening that we have conducted, the functions that we use can complete a single name within 2.20 seconds on average. This means that on a list of 10,000 names, this process could take 6 hours, which is really too long.

Is there a way that we can speed up our process? Here's the snippet of the script that we use.

# Importing packages
import pandas as pd
import Levenshtein as lev

# Reading cleaned datasets
df_name_reference = pd.read_csv('path_to_file') 
df_name_to_screen = pd.read_csv('path_to_file')

# Function used in name screening
def get_similarity_score(s1, s2):
    ''' Return match percentage between 2 strings disregarding name swapping

    Parameters
    -----------
    s1 : str : name from df_name_reference (to be used within pandas apply)
    s2 : str : name from df_name_to_screen (ref_name variable)

    Return
    -----------
    float
    '''
    # Get sorted names
    s1_sort = ' '.join(sorted(s1.split(' '))).strip() if type(s1)==str else ''
    s2_sort = ' '.join(sorted(s2.split(' '))).strip() if type(s2)==str else ''

    # Get ratios and return the max value
    # THIS COULD BE THE BOTTLENECK OF OUR SCRIPT: MORE DETAILS BELOW
    return max([
        lev.ratio(s1, s2),
        lev.ratio(s1_sort, s2),
        lev.ratio(s1, s2_sort),
        lev.ratio(s1_sort, s2_sort)
    ])

# Returning file
screening_results = []

for row in range(df_name_to_screen.shape[0]):

    # Get name to screen
    ref_name = df_name_to_screen.loc[row, 'fullname']
    
    # Get scores
    scores = df_name_reference.fullname.apply(lev.ratio, args=(ref_name,))

    # Append results
    screening_results.append(pd.DataFrame({'screened_name':ref_name, 'scores':scores}))

I took four scores from lev.ratio. This is to address variations in the arrangement of names, ie. firstname-lastname and lastname-firstname formats. I know that fuzzywuzzy package has token_sort_ratio, but I've noticed that it's just splitting the name parts, and sorting it alphabetically, which leads to lower scores. Plus, fuzzywuzzy is slower than Levenshtein. So, I had to manually capture the similarity score of sorted and unsorted names.

Can anyone give an approach that I could try? Thanks!

EDIT: Here's a sample dataset that you may try. This is in Google Drive.

Could you confirm that you want to apply similarity score on the Cartesian product of rows from tables reference and to_screen? — jlandercy, Sep 06 '22 at 06:48
Also remember that the cost to compute Levensthein distance is roughly proportional to the product of the two string lengths (see Wikipedia). Finding a way to reduce length of strings while keeping cross product low might be a path to investigate. — jlandercy, Sep 06 '22 at 07:00
Yes, this is correct. We have a different list of names `to_screen` which will be compared with the names in `reference`. There has been a layer of preprocessing steps (also used metaphone, but this a different process on top of this). So, we’re left with these set of names. — jsv, Sep 06 '22 at 13:12

score 0 · Answer 1 · answered Sep 06 '22 at 07:18

In case you don't need scores for all entries in the reference data but just the top N then you can use difflib.get_close_matches to remove the others before calculating any scores:

screening_results = []
for row in range(df_name_to_screen.shape[0]):
    ref_name = df_name_to_screen.loc[row, 'fullname']
    skimmed = pd.DataFrame({
        'fullname': difflib.get_close_matches(
            ref_name,
            df_name_reference.fullname,
            N_RESULTS,
            0
        )
    })
    scores = skimmed.fullname.apply(lev.ratio, args=(ref_name,))
    screening_results.append(pd.DataFrame({'screened_name': ref_name, 'scores': scores}))

This takes about 50ms per row using the file you provided.

Unfortunately, we need the scores for all names. And I prefer to use `Levenshtein` since we had gone through some tests already with `fuzzywuzzy` (which is also based on difflib) and `Levenshtein`. — jsv, Sep 08 '22 at 01:23

Speeding up fuzzy match on large list

1 Answers1