0

I have a pandas DataFrame in which I have a long string per every row in one column (see variable 'dframe'). In separate list I stored all keywords, which I have to compare with every word from each string from DataFrame. If keyword is found, I have to store it as a success and mark it, in which sentence it has been found. I am using a complex for-loop, with few 'if' statements, which is giving me correct output but it is not very efficient. It takes nearly 4 hours to run on my whole set where I have 130 keywords and thousands of rows to iterate.

I thought to apply some lambda function for optimization and this is something I am struggling with. Below I present you the idea of my data set and my current code.

import pandas as pd
from fuzzywuzzy import fuzz


dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
                           'this is a second e-mail about money',
                           'this would be a next message where people talk about secret information',
                           'this is a sentence where someone misspelled word frad',
                           'this sentence has no keyword']})

keywords = ['fraud','money','secret']


keyword_set = set(keywords)

dframe['Flag'] = False
dframe['part_word'] = 0
output = []


for k in range(0, len(keywords)):
    count_ = 0
    dframe['Flag'] = False
    for j in range(0, len(dframe['Email'])):
        row_list = []
        print(str(k) + '  /  ' + str(len(keywords)) + '  ||  ' +  str(j) + '  /  ' + str(len(dframe['Email'])))
        for i in dframe['Email'][j].split():
            if dframe['part_word'][j] != 0 :
                row_list = dframe['part_word'][j]


            fuz_part = fuzz.partial_ratio(keywords[k].lower(),i.lower())
            fuz_set = fuzz.token_set_ratio(keywords[k],i)

            if ((fuz_part > 90) | (fuz_set > 85)) & (len(i) > 3):
                if keywords[k] not in row_list:
                    row_list.append(keywords[k])
                    print(keywords[k] + '  found as :  ' + i)
                dframe['Flag'][j] = True
                dframe['part_word'][j] = row_list


    count_ = dframe['Flag'].values.sum()
    if count_ > 0:

        y = keywords[k] + ' ' + str(count_)
        output.append(y)
    else:
        y = keywords[k] + ' ' + '0'
        output.append(y)          

Maybe someone who has experience with lambda functions could give me a hint how I could apply it on my DataFrame to perform similar operation ? It would require to somehow apply fuzzymatching in lambda after splitting whole sentence per row to separate words and choosing the value with highest matching value with condition it should be bigger 85 or 90. This is something I am confused with. Thanks in advance for any help.

Typek
  • 3
  • 2

1 Answers1

0

I don't have a lambda function for you, but a function which you can apply to dframe.Email:

import pandas as pd
from fuzzywuzzy import fuzz

At first create the same example dataframe like you:

dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
                       'this is a second e-mail about money',
                       'this would be a next message where people talk about secret information',
                       'this is a sentence where someone misspelled word frad',
                       'this sentence has no keyword']})

keywords = ['fraud','money','secret']

This is the function to apply:

def fct(sntnc, kwds):
    mtch = []
    for kwd in kwds:
        fuz_part = [fuzz.partial_ratio(kwd.lower(), w.lower()) > 90 for w in sntnc.split()]
        fuz_set = [fuzz.token_set_ratio(kwd, w) > 85 for w in sntnc.split()]
        bL = [len(w) > 3 for w in sntnc.split()]
        mtch.append(any([(p | s) & l for p, s, l in zip(fuz_part, fuz_set, bL)]))
    return mtch

For each keyword it calculates fuz_part > 90 for all words in the sentence, the same with fuz_set > 85 and the same with wordlength > 3. And finally for each keyword it saves in an list if there is any ((fuz_part > 90) | (fuz_set > 85)) & (wordlength > 3) in all the words of a sentence .

And this is how it is applied and how the result is created:

s = dframe.Email.apply(fct, kwds=keywords)
s = s.apply(pd.Series).set_axis(keywords, axis=1, inplace=False)
dframe = pd.concat([dframe, s], axis=1)

Result:

result = dframe.drop('Email', 1)
#    fraud  money  secret
# 0   True   True   False                                    
# 1  False   True   False                                     
# 2  False  False    True                                    
# 3   True  False   False                                     
# 4  False  False   False              

result.sum()
# fraud     2
# money     2                                           
# secret    1                                           
# dtype: int64                         
SpghttCd
  • 10,510
  • 2
  • 20
  • 25
  • I have tested your function and it has reduced time by approximately 50%. Thank you very much for help ! – Typek May 27 '19 at 13:32