Drop similar text rows of one column in Python

Question

import pandas as pd
from difflib import SequenceMatcher

df = pd.DataFrame({"id":[9,12,13,14],
                   "text":["Error number 609 at line 10", "Error number 609 at line 22", "Error string 'foo' at line 11", "Error string 'bar' at line 14"]})

Output:

   id                           text
0   9    Error number 609 at line 10
1  12    Error number 609 at line 22
2  13  Error string 'foo' at line 11
3  14  Error string 'bar' at line 14

I want to use difflib.SequenceMatcher to remove similarity score lower than 80 rows and only keep one.

a = "Error number 609 at line 10"
b = "Error number 609 at line 22"
c = "Error string 'foo' at line 11"
d = "Error string 'bar' at line 14"
print(SequenceMatcher(None, a, b).ratio()*100) #92.5925925925926
print(SequenceMatcher(None, b, c).ratio()*100) #60.71428571428571
print(SequenceMatcher(None, c, d).ratio()*100) #86.20689655172413
print(SequenceMatcher(None, a, c).ratio()*100) #64.28571428571429

How can I get expected result as follows in Python? You can use difflib or other python packages. Thank you.

   id                           text
0   9    Error number 609 at line 10
2  13   Error string 'foo' at line 11

yes, drop duplicates by similarity score based on `text` column. — ah bon, Jan 07 '20 at 08:45
you have a text column, so you want to compare each pair of them? It's a C_n^2 problem? which element you want to keep? The logic seems unclear — Travis, Jan 07 '20 at 09:02

jezrael · Accepted Answer · 2020-01-07T09:30:16.387

2

You can use:

#cross join with filter onl text column
df = df.assign(a=1).merge(df[['text']].assign(a=1), on='a')
#filter out same columns per rows
df = df[df['text_x'] != df['text_y']]
#sort columns per rows
df[['text_x','text_y']] = pd.DataFrame(np.sort(df[['text_x','text_y']],axis=1), index=df.index)
#remove duplicates
df = df.drop_duplicates(subset=['text_x','text_y'])
#get similarity
df['r'] = df.apply(lambda x: SequenceMatcher(None, x.text_x, x.text_y).ratio(), axis=1)
#filtering
df = df[df['r'] > 0.8].drop(['a','r'], axis=1)
print (df)
    id                         text_x                         text_y
1    9    Error number 609 at line 10    Error number 609 at line 22
11  13  Error string 'bar' at line 14  Error string 'foo' at line 11

edited Jan 07 '20 at 09:30

answered Jan 07 '20 at 09:02

jezrael

822,522
95
1,334
1,252

1

Sorry my real data have more columns, I get many `_x` and `_y` columns after run `df = df.assign(a=1).merge(df.assign(a=1), on='a')`, idea to avoid this from happening? – ah bon Jan 07 '20 at 09:10
@ahbon - One thing - do you need both `text_x` or `texxt_y` or only one of them, not important if `text_x` or `text_y`? – jezrael Jan 07 '20 at 09:22
I get `'text_x', 'sSource_x', 'sDate_x', 'sAbstract_x', 'a', 'text_y', 'sSource_y', 'sDate_y', 'sAbstract_y'`, but in fact, I think only `'text_x', 'text_y'` are necessary. – ah bon Jan 07 '20 at 09:30
Another question, I can keep either `text_x` or `text_y`, or it's better to keep `text_x`? – ah bon Jan 07 '20 at 09:47
1

@ahbon - I think `text_x` column is similar `text_y`, so one column should be enough. – jezrael Jan 07 '20 at 09:50
Does this line `df = df[df['r'] > 0.8].drop(['a','r'], axis=1)` means drop columns `a` and `r` and also drop rows which `r`'s values <= `0.8`? – ah bon Jan 07 '20 at 10:02

Drop similar text rows of one column in Python

1 Answers1

Linked