1

Compare each row in column with every row in the same column and remove the row if match ratio is > 90 with fuzzy logic in python. I tried removing using duplicates, but there are some rows with same content with some extra information. The data is like below

print(df)

Output is :

    Page no
0   Hello
2   Hey
3   Helloo
4   Heyy
5   Hellooo

I'm trying to compare each row with every row and remove if row matches the content with ratio greater than 90 using fuzzy logic. The expected output is :

    Page no
0   Hello
2   Hey

The code i tried is :

def func(name):
    matches = df.apply(lambda row: (fuzz.ratio(row['Content'], name) >= 90), axis=1)
    print(matches)
    return [i for i, x in enumerate(matches) if x]

func("Hey")

The above code only checks for one row with sentence Hey

Can anyone please help me with code? It would be really helpful

  • what library are you using for "fuzzy" matching / synonyms. Your example is just startswith, is that what you want? need to define your algorithm... – Rob Raymond Jul 02 '21 at 16:53
  • I've updated the code with which i've tried, but that works only for one row and we've to type the sentence. Can you please look into question – Sunny Reddy Jul 02 '21 at 17:18

1 Answers1

3
  • use itertools.combinations to get all combinations of values
  • then apply() fuzz.ratio()
  • analyse results and select rows that don't have a strong match to another combination
import pandas as pd
import io
import itertools
from fuzzywuzzy import fuzz

df = pd.read_csv(
    io.StringIO(
        """    Page_no
0   Hello
2   Hey
3   Helloo
4   Heyy
5   Hellooo"""
    ),
    sep="\s+",
)

# find combinations that have greater than 80 match
dfx = pd.DataFrame(itertools.combinations(df["Page_no"].values, 2)).assign(
    ratio=lambda d: d.apply(lambda t: fuzz.ratio(t[0], t[1]), axis=1)
).loc[lambda d: d["ratio"].gt(80)]

# exclude rows that have big match to another row...
df.loc[~df["Page_no"].isin(dfx[1])]

Page_no
0 Hello
2 Hey
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30