I have a pandas DataFrame in which I have a long string per every row in one column (see variable 'dframe'). In separate list I stored all keywords, which I have to compare with every word from each string from DataFrame. If keyword is found, I have to store it as a success and mark it, in which sentence it has been found. I am using a complex for-loop, with few 'if' statements, which is giving me correct output but it is not very efficient. It takes nearly 4 hours to run on my whole set where I have 130 keywords and thousands of rows to iterate.
I thought to apply some lambda function for optimization and this is something I am struggling with. Below I present you the idea of my data set and my current code.
import pandas as pd
from fuzzywuzzy import fuzz
dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
'this is a second e-mail about money',
'this would be a next message where people talk about secret information',
'this is a sentence where someone misspelled word frad',
'this sentence has no keyword']})
keywords = ['fraud','money','secret']
keyword_set = set(keywords)
dframe['Flag'] = False
dframe['part_word'] = 0
output = []
for k in range(0, len(keywords)):
count_ = 0
dframe['Flag'] = False
for j in range(0, len(dframe['Email'])):
row_list = []
print(str(k) + ' / ' + str(len(keywords)) + ' || ' + str(j) + ' / ' + str(len(dframe['Email'])))
for i in dframe['Email'][j].split():
if dframe['part_word'][j] != 0 :
row_list = dframe['part_word'][j]
fuz_part = fuzz.partial_ratio(keywords[k].lower(),i.lower())
fuz_set = fuzz.token_set_ratio(keywords[k],i)
if ((fuz_part > 90) | (fuz_set > 85)) & (len(i) > 3):
if keywords[k] not in row_list:
row_list.append(keywords[k])
print(keywords[k] + ' found as : ' + i)
dframe['Flag'][j] = True
dframe['part_word'][j] = row_list
count_ = dframe['Flag'].values.sum()
if count_ > 0:
y = keywords[k] + ' ' + str(count_)
output.append(y)
else:
y = keywords[k] + ' ' + '0'
output.append(y)
Maybe someone who has experience with lambda functions could give me a hint how I could apply it on my DataFrame to perform similar operation ? It would require to somehow apply fuzzymatching in lambda after splitting whole sentence per row to separate words and choosing the value with highest matching value with condition it should be bigger 85 or 90. This is something I am confused with. Thanks in advance for any help.