I am new to python and I'm running a fuzzywuzzy string matching logic on a list with 2 million records. The code is working and it is giving output as well. The problem is that it is extremely slow. In 3 hours it processes only 80 rows. I want to speed things up by making it process multiple rows at once.
If it it helps - I am running it on my machine with 16Gb RAM and 1.9 GHz dual core CPU.
Below is the code I'm running.
d = []
n = len(Africa_Company) #original list with 2m string records
for i in range(1,n):
choices = Africa_Company[i+1:n]
word = Africa_Company[i]
try:
output= process.extractOne(str(word), str(choices), score_cutoff=85)
except Exception:
print (word) #to identify which string is throwing an exception
print (i) #to know how many rows are processed, can do without this also
if output:
d.append({'Company':Africa_Company[i],
'NewCompany':output[0],
'Score':output[1],
'Region':'Africa'})
else:
d.append({'Company':Africa_Company[i],
'NewCompany':None,
'Score':None,
'Region':'Africa'})
Africa_Corrected = pd.DataFrame(d) #output data in a pandas dataframe
Thanks in advance !