0

I have a dataframe like this input dataframe

now i want to normalize the string in the 'comments' column for the word 'election' . I tried using fuzzywuzzy but wasn't able to implement it on pandas dataframe to partially match the word 'election'. The output dataframe should have the word 'election' in the 'comments' column like this output dataframe

Assume that i have around 100k rows and possible combinations for the word 'election' can be many. Kindly guide me on this part.

Community
  • 1
  • 1
nOObda
  • 123
  • 1
  • 2
  • 9
  • if you can install it, check at [pyenchant](https://github.com/rfk/pyenchant) for spell checking (seems more your issue here) – Ben.T Apr 26 '18 at 14:44
  • @Ben.T can you check the answer I posted and give any suggestion to reduce the runtime. – nOObda Apr 28 '18 at 17:11

3 Answers3

2

with the answer you gave, you can use pandas apply, stack and groupby functions to accelerate your code. you have input such as:

import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'Merchant details': ['Alpha co','Bravo co'],
                   'Comments':['electionsss are around',
                               'vote in eelecttions']}) 

For the column 'comments', you can create a temporary mutiindex DF containing a word per row by splitting and using stack function:

df_temp = pd.DataFrame(
    {'split_comments':df['Comments'].str.split(' ',expand=True).stack()})

Then you create the column with corrected word (according to your idea), using apply and the comparision of fuzz.ratio:

df_temp['corrected_comments'] = df_temp['split_comments'].apply(
    lambda wd: 'election' if fuzz.ratio(wd, 'election') > 75 else wd)

Finally, you write back in your column Comments of df with the corrected data using groupby and join functions:

df['Comments'] = df_temp.reset_index().groupby('level_0').apply(
    lambda wd: ' '.join(wd['corrected_comments']))
maxymoo
  • 35,286
  • 11
  • 92
  • 119
Ben.T
  • 29,160
  • 6
  • 32
  • 54
1

Don't operate on the dataframe. The overhead will kill you. Turn the column into a list, then iteratecover that. And finally assign that list back to the column.

Batman
  • 8,571
  • 7
  • 41
  • 80
0

Ok i tried this myself and came up with this code -

for i in range(len(df)):
a = []
a = df.comments[i].split()
for j in word:
    for k in range(len(a)):
        if fuzz.ratio(j,a[k]) > 75:
            a[k] = j
df.comments[i] = a
df.comments[i] = ' '.join(df.comments[i])

But this approach seems slow for a large dataframe. Can someone provide a better pythonic way of implementing this.

nOObda
  • 123
  • 1
  • 2
  • 9
  • Hi, I'm working on a similar problem. Were you able to find any better optimisation techniques for this task as I have a huge dataset to work on. – Akhilesh Pothuri Nov 11 '20 at 05:32
  • 1
    Hi Akhilesh, Leverage the methodology mentioned in the first answer, that should help. – nOObda Nov 11 '20 at 15:57