0

I am working on name matching problem where I have names of customers which need to be compared with 2.5 Million records of existing customers saved in csv file. Below is the code which I tried and its taking 5-12 minutes for single name matching. As this will be integrated as API with RPA process, suggest me any other way to achieve the same within one or two mins.

from fuzzywuzzy import fuzz
import time

# names is the list passed to the program as parameter
names_with_sno = [[sno, name] for sno, name in enumerate(names, 1)]

# dataframe created for the given customer names
df1 = pd.DataFrame(names_with_sno, columns=['s_no','SDN_NAME_SERACH'])

# dataframe for customer database via csv
cust_2 = pd.read_csv(r'...\customer-database-extract\extract.CSV')

# .... preprocessing of both the dataframes
# .... which are not time consuming ones

### CROSS JOIN
#doing the cross join between the given names and customer database
#creating common key in the dataframe having the given names
df1["key"]=1

#creating common key in customer db dataset
cust_2["key"]=1

#sdropping the common column key after creating the cross join
final_df = pd.merge(df1,cust_2,on="key").drop("key",1)   

**def get_ratio(df):
    cust_name=df["FIRST_NAME"]
    hit_name=df["SDN_NAME_SERACH"]
    return fuzz.token_set_ratio(cust_name,hit_name)**

st = time.mktime(time.localtime())

#applying the function for name _mtahcing and storing it in a series
**final_series = final_df.apply(get_ratio,axis=1)**

print('\n\nt23 - df.apply(get_ratio) - ',secondsToText(time.mktime(time.localtime()) - st))

here, df1 is the dataframe of given name and cust_2 is DB extract read from the csv file. The print gives the time as,

t23 - df.apply(get_ratio) - 5.0 minutes, 42.0 seconds

  • Can anyone please help me on this? If my question is not clear please suggest me a better way of asking it, as it is my first question to open forum. – user16351455 Oct 20 '21 at 06:34
  • As a first step you could try to replace the usage of FuzzyWuzzy with [RapidFuzz](https://github.com/maxbachmann/RapidFuzz) which should already reduce the runtime a lot ( I would expect at least a 10x improvement) – maxbachmann Oct 20 '21 at 16:00
  • sorry @maxbachmann, I was on vacation for a week, so couldn't track this question... I will check the RapidFuzz and come up with the results... Thanks for your answer, it will help me a lot if it works. – user16351455 Oct 28 '21 at 09:19
  • thanks for your suggestion @maxbachmann... Sorry I was on vacation, haven't checked the question for a while... will check the RapidFuzz and get back with the findings... It will help me a lot if it works... – user16351455 Oct 28 '21 at 09:22
  • Apologies, I thought that I have already commented... Thanks @maxbachmann, it has worked, using rapifuzz has reduced the execution time by 3 to 4 times, earlier it was 6 to 8 mins, now it is around 2.5 mins... Thanks for your suggestion. – user16351455 Nov 11 '21 at 08:41

0 Answers0