3

I want to use fuzzywuzzy package on the following table

x   Reference   amount
121 TOR1234        500
121 T0R1234        500
121 W7QWER         500
121 W1QWER         500
141 TRYCATC        700
141 TRYCATC        700
151 I678MKV        300
151 1678MKV        300
  1. I want to group the table where the columns 'x' and 'amount' match.
  2. for each reference in the group i. Compare(fuzzywuzzy) with other references in that group. a. where the match is 100%, delete them b. where the match is 90-99.99%, keep them c. delete anything below 90% match for that particular row the expected output-
 x   y     amount
151 I678MKV 300
151 1678MKV 300
121 TOR1234 500
121 T0R1234 500
121 W7QWER  500
121 W1QWER  500

This is to detect the fraud entries, Like in the tables, '1' is replaced by 'I' and '0' is replaced by 'O'. If you any alternative solution, please suggest.

tawab_shakeel
  • 3,701
  • 10
  • 26

1 Answers1

2

What I have understand you don't need fuzzywuzzy package approach use simple drop_duplicateswith keep=False

df = pd.DataFrame(data={"x":[121,121,121,121,141,141,151,151],
                   "Refrence":["TOR1234","T0R1234","W7QWER","W1QWER","TRYCATC","TRYCATC"
                               ,"I678MKV","1678MKV"],
                   "amount":[500,500,500,500,700,700,300,300]})
res = df.drop_duplicates(['x','Refrence','amount'],keep=False).sort_values(['x'],ascending=[False])

print(res)
     x Refrence  amount
6  151  I678MKV     300
7  151  1678MKV     300
0  121  TOR1234     500
1  121  T0R1234     500
2  121   W7QWER     500
3  121   W1QWER     500

Apply levenshtein distance on the Refrence within the same x

from itertools import combinations
from similarity.damerau import Damerau
levenshtien = Damerau()

data = list(combinations(res['Refrence'], 2))

refrence_df = pd.DataFrame(data,columns=['Refrence','Refrence2'])

refrence_df = pd.merge(refrence_df,df[['x','Refrence']],on=['Refrence'],how='left')
refrence_df = pd.merge(refrence_df,df[['x','Refrence']],left_on=['Refrence2'],right_on=['Refrence'],how='left')

refrence_df.rename(columns={'x_x':'x_1','x_y':'x_2','Refrence_x':'Refrence'},inplace=True)

refrence_df.drop(['Refrence_y'],axis=1,inplace=True)

refrence_df = refrence_df[refrence_df['x_1']==refrence_df['x_2']]

refrence_df['edit_required'] = refrence_df.apply(lambda x: levenshtien.distance(x['Refrence'],x['Refrence2']),
                                                   axis=1)

refrence_df['characters_not_common'] = refrence_df.apply(lambda x :list(set(x['Refrence'])-set(x['Refrence2'])),axis=1)
print(refrence_df)
    Refrence Refrence2  x_1  x_2  edit_required characters_not_common
0   I678MKV   1678MKV  151  151              1                   [I]
9   TOR1234   T0R1234  121  121              1                   [O]
10  TOR1234    W7QWER  121  121              7    [O, T, 1, 3, 2, 4]
11  TOR1234    W1QWER  121  121              7       [O, T, 3, 2, 4]
12  T0R1234    W7QWER  121  121              7    [T, 1, 0, 3, 2, 4]
13  T0R1234    W1QWER  121  121              7       [T, 0, 3, 2, 4]
14   W7QWER    W1QWER  121  121              1                   [7]
Community
  • 1
  • 1
tawab_shakeel
  • 3,701
  • 10
  • 26
  • This is to detect the fraud entries, Like in the tables, '1' is replaced by 'I' and '0' is replaced by 'O'. If you any alternative solution, please suggest. – ROHIT MUNDHRA Jul 25 '19 at 06:45
  • so there is a thing `levenshtein distance` after dropping the duplicates you can calculate levenstein distance on Refrence within same x which would suggest you how many edits between to refrence make it equal – tawab_shakeel Jul 25 '19 at 06:48
  • https://github.com/luozhouyang/python-string-similarity#normalized-levenshtein one of the library plus there are a lot of things you need to research it – tawab_shakeel Jul 25 '19 at 06:49
  • levenshtein distance- I tried making this work on the problem, but was not able to. If you could help me with the implementation, that'd be great – ROHIT MUNDHRA Jul 25 '19 at 06:53
  • @ROHITMUNDHRA I have applied `levenshtein distance` and find out which characters are not common among two Refrence Now try yourself how to replace those strings within the Refrence – tawab_shakeel Jul 25 '19 at 07:09