2

I have a Pandas DataFrame with two relevant columns. I need to check column A (a list of names) against itself, and if two (or more) values are similar enough to each other, I sum the values in column B for those rows. To check similarity, I'm using the FuzzyWuzzy package that accepts two strings and returns a score.

Data:

a            b   
apple        3 
orang        4 
aple         1  
orange       10  
banana       5

I want to be left with:

a       b
apple   4
orang   14
banana  5

I have tried the following line, but I keep getting a KeyError

    df['b']=df.apply(lambda x: df.loc[fuzz.ratio(df.a,x.a)>=70,'b'].sum(), axis=1)

I would also need to remove all rows where column b was added into another row.

Any thoughts on how to accomplish this?

phltop
  • 23
  • 3

2 Answers2

0

Some parts here are best done with pandas, and some parts (e.g., a function applied to a cartesian product) can be done without it.

Overall, you can do this with:

import itertools
import numpy as np

alias = {l : r for l, r in itertools.product(df.a, df.a) if l < r and 
fuzz.ratio(l, r) > 70}
>>> df.b.groupby(df.a.replace(alias)).sum()
apple      4
banana     5
orange    14
Name: b, dtype: int64

The line

alias = {l : r for l, r in itertools.product(df.a, df.a) if l < r and 
fuzz.ratio(l, r) > 70}

creates a map alias, mapping words to their alias from a.

The line

df.b.groupby(df.a.replace(alias)).sum()

groups b by a translation using alias, and then sums.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
0

I would map and groupby:

def get_similarity(df, ind, col):
    mapped = list(map(lambda x: fuzz.ratio(x, df[col].loc[ind]), df[col]))
    cond = (np.array(mapped) >= 70)
    label = df[col][cond].iloc[0]

    return label

And use like this:

df.groupby(lambda x: get_similarity(df, x, 'a'))['b'].sum()
patricio
  • 324
  • 1
  • 4