I have a pandas data frame that contains two columns named Potential Word
, Fixed Word
. The Potential Word
column contains words of different languages which contains spell mistakes words and correct words and the Fixed Word
column contains the correct words corresponded to Potential Word
.
Below I have shared some of the samples data
Potential Word | Fixed Word |
---|---|
Exemple | Example |
pipol | People |
pimple | Pimple |
Iunik | unique |
My vocab Dataframe contains 600K unique row.
My Solution:
key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
match_value = match(each, key) # match is a function that returns a
# similarity value of two strings
if match_value > glob_match_value and match_value > match_threshold:
glob_match_value = match_value
potential_fixed_word = each
Problem
The problem with my code its takes a lot of time to fix every word because of the loop running through the large vocab list. When a word is missed on the vocab then it takes almost 5 or 6 sec to solve a sentence of 10 ~12 words. The match function performs decently so the objective of the optimization.
I need optimized solution help me here