1
import pandas as pd
from difflib import SequenceMatcher

df = pd.DataFrame({"id":[9,12,13,14],
                   "text":["Error number 609 at line 10", "Error number 609 at line 22", "Error string 'foo' at line 11", "Error string 'bar' at line 14"]})

Output:

   id                           text
0   9    Error number 609 at line 10
1  12    Error number 609 at line 22
2  13  Error string 'foo' at line 11
3  14  Error string 'bar' at line 14

I want to use difflib.SequenceMatcher to remove similarity score lower than 80 rows and only keep one.

a = "Error number 609 at line 10"
b = "Error number 609 at line 22"
c = "Error string 'foo' at line 11"
d = "Error string 'bar' at line 14"
print(SequenceMatcher(None, a, b).ratio()*100) #92.5925925925926
print(SequenceMatcher(None, b, c).ratio()*100) #60.71428571428571
print(SequenceMatcher(None, c, d).ratio()*100) #86.20689655172413
print(SequenceMatcher(None, a, c).ratio()*100) #64.28571428571429

How can I get expected result as follows in Python? You can use difflib or other python packages. Thank you.

   id                           text
0   9    Error number 609 at line 10
2  13   Error string 'foo' at line 11
ah bon
  • 9,293
  • 12
  • 65
  • 148

1 Answers1

2

You can use:

#cross join with filter onl text column
df = df.assign(a=1).merge(df[['text']].assign(a=1), on='a')
#filter out same columns per rows
df = df[df['text_x'] != df['text_y']]
#sort columns per rows
df[['text_x','text_y']] = pd.DataFrame(np.sort(df[['text_x','text_y']],axis=1), index=df.index)
#remove duplicates
df = df.drop_duplicates(subset=['text_x','text_y'])
#get similarity
df['r'] = df.apply(lambda x: SequenceMatcher(None, x.text_x, x.text_y).ratio(), axis=1)
#filtering
df = df[df['r'] > 0.8].drop(['a','r'], axis=1)
print (df)
    id                         text_x                         text_y
1    9    Error number 609 at line 10    Error number 609 at line 22
11  13  Error string 'bar' at line 14  Error string 'foo' at line 11
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Sorry my real data have more columns, I get many `_x` and `_y` columns after run `df = df.assign(a=1).merge(df.assign(a=1), on='a')`, idea to avoid this from happening? – ah bon Jan 07 '20 at 09:10
  • @ahbon - One thing - do you need both `text_x` or `texxt_y` or only one of them, not important if `text_x` or `text_y`? – jezrael Jan 07 '20 at 09:22
  • I get `'text_x', 'sSource_x', 'sDate_x', 'sAbstract_x', 'a', 'text_y', 'sSource_y', 'sDate_y', 'sAbstract_y'`, but in fact, I think only `'text_x', 'text_y'` are necessary. – ah bon Jan 07 '20 at 09:30
  • Another question, I can keep either `text_x` or `text_y`, or it's better to keep `text_x`? – ah bon Jan 07 '20 at 09:47
  • 1
    @ahbon - I think `text_x` column is similar `text_y`, so one column should be enough. – jezrael Jan 07 '20 at 09:50
  • Does this line `df = df[df['r'] > 0.8].drop(['a','r'], axis=1)` means drop columns `a` and `r` and also drop rows which `r`'s values <= `0.8`? – ah bon Jan 07 '20 at 10:02