I have 2 datasets, using data from df1 I want to identify duplicate data in df2 using 4 conditions.
- Conditions:
If a row of df1 'Name' column matches more than 80% with any row of 'Name' column in df2
(AND)
(df1['Class'] == df2['Class'] (OR) df1['Amt $'] == df2['Amt $'])
(AND)
If row of 'Category' column in df1 matches more than 80% with any row item of 'Category' column in df2
- Outcome:
if all conditions are met then keep only the new data in df2 and delete the other rows.
df1
Name Class Amt $ Category
Apple 1 5 Fruit
Banana 2 8 Fruit
Cat 3 4 Animal
df2
Index Name Class Amt $ Category
1 Apple is Red 1 5 Fruit
2 Banana 2 8 fruits
3 Cat is cute 3 4 animals
4 Green Apple 1 5 fruis
5 Banana is Yellow 2 8 fruet
6 Cat 3 4 anemal
7 Apple 1 5 anemal
8 Ripe Banana 2 8 frut
9 Royal Gala Apple 1 5 Fruit
10 Cats 3 4 animol
11 Green Banana 2 8 Fruit
12 Green Apple 1 5 fruits
13 White Cat 3 4 Animal
14 Banana is sweet 2 8 appel
15 Apple is Red 1 5 fruits
16 Ginger Cat 3 4 fruits
17 Cat house 3 4 animals
18 Royal Gala Apple 1 5 fret
19 Banana is Yellow 2 8 fruit market
20 Cat is cute 3 4 anemal
- Code I tried:
for i in df1['Name']:
for u in df2['Name']:
for k in df1['Class']:
for l in df2['Class']:
for m in df1['Amt $']:
for n in df2['Amt $']:
for o in df1['Category']:
for p in df2['Category']:
if SequenceMatcher(None, i, u).ratio() > .8 and k == l and m == n and SequenceMatcher(None, o, p).ratio() > 0.8:
print(i, u)
Desired output dataframe should like something like this:
Name Class Amt $ Category
Apple is Red 1 5 Fruit
Banana 2 8 fruits
Cat is cute 3 4 animals
Green Apple 1 5 fruis
Banana is Yellow 2 8 fruet
Cat 3 4 anemal
Ripe Banana 2 8 frut
Royal Gala Apple 1 5 Fruit
Cats 3 4 animol
Green Banana 2 8 Fruit
Green Apple 1 5 fruits
White Cat 3 4 Animal
Apple is Red 1 5 fruits
Cat house 3 4 animals
Banana is Yellow 2 8 fruit market
Cat is cute 3 4 anemal
Please help me with the best solution! :)