-1

I am working on a project that uses fuzzy logic on a list of names that could go about 100,000 unique records. On the recent screening that we have conducted, the functions that we use can complete a single name within 2.20 seconds on average. This means that on a list of 10,000 names, this process could take 6 hours, which is really too long.

Is there a way that we can speed up our process? Here's the snippet of the script that we use.

# Importing packages
import pandas as pd
import Levenshtein as lev

# Reading cleaned datasets
df_name_reference = pd.read_csv('path_to_file') 
df_name_to_screen = pd.read_csv('path_to_file')

# Function used in name screening
def get_similarity_score(s1, s2):
    ''' Return match percentage between 2 strings disregarding name swapping

    Parameters
    -----------
    s1 : str : name from df_name_reference (to be used within pandas apply)
    s2 : str : name from df_name_to_screen (ref_name variable)

    Return
    -----------
    float
    '''
    # Get sorted names
    s1_sort = ' '.join(sorted(s1.split(' '))).strip() if type(s1)==str else ''
    s2_sort = ' '.join(sorted(s2.split(' '))).strip() if type(s2)==str else ''

    # Get ratios and return the max value
    # THIS COULD BE THE BOTTLENECK OF OUR SCRIPT: MORE DETAILS BELOW
    return max([
        lev.ratio(s1, s2),
        lev.ratio(s1_sort, s2),
        lev.ratio(s1, s2_sort),
        lev.ratio(s1_sort, s2_sort)
    ])

# Returning file
screening_results = []

for row in range(df_name_to_screen.shape[0]):

    # Get name to screen
    ref_name = df_name_to_screen.loc[row, 'fullname']
    
    # Get scores
    scores = df_name_reference.fullname.apply(lev.ratio, args=(ref_name,))

    # Append results
    screening_results.append(pd.DataFrame({'screened_name':ref_name, 'scores':scores}))

I took four scores from lev.ratio. This is to address variations in the arrangement of names, ie. firstname-lastname and lastname-firstname formats. I know that fuzzywuzzy package has token_sort_ratio, but I've noticed that it's just splitting the name parts, and sorting it alphabetically, which leads to lower scores. Plus, fuzzywuzzy is slower than Levenshtein. So, I had to manually capture the similarity score of sorted and unsorted names.

Can anyone give an approach that I could try? Thanks!

EDIT: Here's a sample dataset that you may try. This is in Google Drive.

jsv
  • 105
  • 1
  • 8
  • Could you confirm that you want to apply similarity score on the Cartesian product of rows from tables reference and to_screen? – jlandercy Sep 06 '22 at 06:48
  • Also remember that the cost to compute Levensthein distance is roughly proportional to the product of the two string lengths (see Wikipedia). Finding a way to reduce length of strings while keeping cross product low might be a path to investigate. – jlandercy Sep 06 '22 at 07:00
  • Yes, this is correct. We have a different list of names `to_screen` which will be compared with the names in `reference`. There has been a layer of preprocessing steps (also used metaphone, but this a different process on top of this). So, we’re left with these set of names. – jsv Sep 06 '22 at 13:12

1 Answers1

0

In case you don't need scores for all entries in the reference data but just the top N then you can use difflib.get_close_matches to remove the others before calculating any scores:

screening_results = []
for row in range(df_name_to_screen.shape[0]):
    ref_name = df_name_to_screen.loc[row, 'fullname']
    skimmed = pd.DataFrame({
        'fullname': difflib.get_close_matches(
            ref_name,
            df_name_reference.fullname,
            N_RESULTS,
            0
        )
    })
    scores = skimmed.fullname.apply(lev.ratio, args=(ref_name,))
    screening_results.append(pd.DataFrame({'screened_name': ref_name, 'scores': scores}))

This takes about 50ms per row using the file you provided.

Michal Racko
  • 462
  • 2
  • 4
  • Unfortunately, we need the scores for all names. And I prefer to use `Levenshtein` since we had gone through some tests already with `fuzzywuzzy` (which is also based on difflib) and `Levenshtein`. – jsv Sep 08 '22 at 01:23