4

I am trying to approximately match 600,000 individuals names (Full name) to another database that has over 87 millions observations (Full name) !

My first attempt with fuzzywuzzy library was way too slow, so I decided to use the module fuzzyset which is much faster. Assuming I have a computer powerful enough to load all the dataset in memory, I am doing the following with a test file of 964 observations to be matched against 50,000 observations:

import time
from cfuzzyset import cFuzzySet as FuzzySet

df1=pd.read_csv(file1,delimiter='|') # test file with 964 observations
df2=pd.read_csv(file2,delimiter='|') # test file with 50,000 observations to be matched against

a=FuzzySet() # allocate the FuzzySet object
for row in file2['name']:
   a.add(str(row)) # Fill the FuzzySet object with all names from file2

start_time = time.time() # Start recording the time

dicto={'index':[],'name':[]} # Dictionary where I store the output

for names in file1['f_ofulln']:
    dicto['index'].append(a.get(names)[0][0])
    dicto['name'].append(a.get(names)[0][1])

print("--- %s seconds ---" % (time.time() - start_time))   

>>> --- 39.68284249305725 seconds ---

With a much smaller dataset (964 observations matched against 50,000 observations), the time was 39 sec.

However, this is too slow if I want to perform this method on the full dataset.

Does anyone has an idea of how to improve the run time ? I think that Cython is not a possibility since I am already importing the Cython version of fuzzyset module

Many thanks,

Adrien

Adrien
  • 461
  • 5
  • 19
  • 1
    This problem is a lot harder than it seems. You might have to block (or clustering) 87M data first before you perform matching. I wouldn't suggest finding all distance between 600k name to every names in database. Python library called `dedupe` has some implementation of blocking technique. However, I'm not sure if it's going to scale to dataset that you have. Another possibility is to drop_duplicate names in both sets that you have before you perform fuzzy matching. (sorry for my vague answer...) – titipata Jul 19 '16 at 23:25
  • Thanks titpat, I will definitely do as much as possible to reduce the size of datasets to be merged. I also modified the question, since I discovered that the fuzzyset module is much faster. Last, I found ressources to store the entire dataframe in memory. But even with all that, the final run time is still counted in weeks ! – Adrien Jul 20 '16 at 06:22

1 Answers1

3

So I am going to answer my own question since I found a way that is pretty fast.

I saved both databases in HDF5 format, using the panda.HDFStore and panda.to_hdf methods. I saved into one dataframe for each first letter of the last name. Then, I created a function finding the closest match, based on the python-Levenshtein module (very fast, since it is programmed in C).

And last, I sent 26 batch jobs at once, one for each letter of the lastname. That means I only match people with same initial of lastname.

Note that I also programmed the function to find closest match with birthyear not different by more than 1 year.

EDIT: Since it was requested, I am providing below a summary of my function. The main function that merges two dataframes is too long to be posted here unfortunately.

# Needed imports:
from Levenshtein import *
import pandas as pd

# Function that get the closest match of a word in a big list:

def get_closest_match(x, list_strings,fun):
    # fun: the matching method : ratio, wrinkler, ... (cf python-Levenshtein module)
    best_match = None
    highest_ratio = 0
    for current_string in list_strings.values.tolist():
        if highest_ratio!=1:
            current_score = fun(x, current_string)
            if(current_score > highest_ratio):
                highest_ratio = current_score
                best_match = current_string
    return (best_match,highest_ratio)

# the function that matches 2 dataframes (only the idea behind, since too long to write everything
dicto={'Index':[],'score':[], ...} 
def LevRatioMerge(df1,df2,fun,colname,required=[],approx=[],eps=[]):
    # Basically loop over df1 with:
    for name in df1.itertuples():
        result=get_closest_match(name[YourColnumber],df2[YourColname],fun)
        dicto['score'].append(result[1])
        dicto['Index'].append(name[0])
        ...

This is the idea. Hope that it is inspiring enough for your job.

Adrien
  • 461
  • 5
  • 19
  • So ages ago, but I am somewhat building on the same logic. May I know how you wrote your python-Levenshtein function? I am using process from FuzzyWuzzy and seems too slow for my needs. – BernardL Aug 21 '16 at 06:24
  • I edited my answer, unfortunately my function is very long and really tailored for my problem. I hope that this small summary is enough for you to use the python-Levenshtein module for your own problem. – Adrien Aug 23 '16 at 15:19
  • Hi. I am trying to implement the same logic for address match, keeping "city_name" as my blocking or indexing key. Is it possible to see the function you have written? That will be helpful for me. – ds_user Jun 20 '17 at 10:40