I have a Pandas DataFrame with two relevant columns. I need to check column A (a list of names) against itself, and if two (or more) values are similar enough to each other, I sum the values in column B for those rows. To check similarity, I'm using the FuzzyWuzzy package that accepts two strings and returns a score.
Data:
a b
apple 3
orang 4
aple 1
orange 10
banana 5
I want to be left with:
a b
apple 4
orang 14
banana 5
I have tried the following line, but I keep getting a KeyError
df['b']=df.apply(lambda x: df.loc[fuzz.ratio(df.a,x.a)>=70,'b'].sum(), axis=1)
I would also need to remove all rows where column b was added into another row.
Any thoughts on how to accomplish this?