I am trying to approximately match 600,000 individuals names (Full name) to another database that has over 87 millions observations (Full name) !
My first attempt with fuzzywuzzy library was way too slow, so I decided to use the module fuzzyset which is much faster. Assuming I have a computer powerful enough to load all the dataset in memory, I am doing the following with a test file of 964 observations to be matched against 50,000 observations:
import time
from cfuzzyset import cFuzzySet as FuzzySet
df1=pd.read_csv(file1,delimiter='|') # test file with 964 observations
df2=pd.read_csv(file2,delimiter='|') # test file with 50,000 observations to be matched against
a=FuzzySet() # allocate the FuzzySet object
for row in file2['name']:
a.add(str(row)) # Fill the FuzzySet object with all names from file2
start_time = time.time() # Start recording the time
dicto={'index':[],'name':[]} # Dictionary where I store the output
for names in file1['f_ofulln']:
dicto['index'].append(a.get(names)[0][0])
dicto['name'].append(a.get(names)[0][1])
print("--- %s seconds ---" % (time.time() - start_time))
>>> --- 39.68284249305725 seconds ---
With a much smaller dataset (964 observations matched against 50,000 observations), the time was 39 sec.
However, this is too slow if I want to perform this method on the full dataset.
Does anyone has an idea of how to improve the run time ? I think that Cython is not a possibility since I am already importing the Cython version of fuzzyset module
Many thanks,
Adrien