-2

I have 2 datasets, using data from df1 I want to identify duplicate data in df2 using 4 conditions.

  • Conditions:

If a row of df1 'Name' column matches more than 80% with any row of 'Name' column in df2

(AND)

(df1['Class'] == df2['Class'] (OR) df1['Amt $'] == df2['Amt $'])

(AND)

If row of 'Category' column in df1 matches more than 80% with any row item of 'Category' column in df2

  • Outcome:

if all conditions are met then keep only the new data in df2 and delete the other rows.

df1

Name    Class   Amt $   Category
Apple      1    5       Fruit
Banana     2    8       Fruit
Cat        3    4       Animal

df2

Index   Name              Class Amt $   Category
    1   Apple is Red       1    5       Fruit
    2   Banana             2    8       fruits
    3   Cat is cute        3    4       animals
    4   Green Apple        1    5       fruis
    5   Banana is Yellow   2    8       fruet
    6   Cat                3    4       anemal
    7   Apple              1    5       anemal
    8   Ripe Banana        2    8       frut
    9   Royal Gala Apple   1    5       Fruit
    10  Cats               3    4       animol
    11  Green Banana       2    8       Fruit
    12  Green Apple        1    5       fruits
    13  White Cat          3    4       Animal
    14  Banana is sweet    2    8       appel
    15  Apple is Red       1    5       fruits
    16  Ginger Cat         3    4       fruits
    17  Cat house          3    4       animals
    18  Royal Gala Apple   1    5       fret
    19  Banana is Yellow   2    8       fruit market
    20  Cat is cute        3    4       anemal

  • Code I tried:

for i in df1['Name']:
    for u in df2['Name']:
        for k in df1['Class']:
            for l in df2['Class']:
                for m in df1['Amt $']:
                    for n in df2['Amt $']:
                        for o in df1['Category']:
                            for p in df2['Category']:
                                if SequenceMatcher(None, i, u).ratio() > .8 and k == l and m == n and SequenceMatcher(None, o, p).ratio() > 0.8:
                                    print(i, u)

Desired output dataframe should like something like this:

Name              Class Amt $   Category
Apple is Red        1   5       Fruit
Banana              2   8       fruits
Cat is cute         3   4       animals
Green Apple         1   5       fruis
Banana is Yellow    2   8       fruet
Cat                 3   4       anemal
Ripe Banana         2   8       frut
Royal Gala Apple    1   5       Fruit
Cats                3   4       animol
Green Banana        2   8       Fruit
Green Apple         1   5       fruits
White Cat           3   4       Animal
Apple is Red        1   5       fruits
Cat house           3   4       animals
Banana is Yellow    2   8       fruit market
Cat is cute         3   4       anemal

Please help me with the best solution! :)

  • And what should be the output for df1 and df2 examples? –  Mar 12 '20 at 01:08
  • a new dataframe showing the rows from df2 where conditions are met – Vin Bolisetti Mar 12 '20 at 01:18
  • Show us an example of what that new dataframe looks like. – David Collins Mar 12 '20 at 01:19
  • What is the issue, exactly? Have you tried anything, done any research? Stack Overflow is not a free code writing service. See: [ask], [help/on-topic], https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users. – AMC Mar 12 '20 at 01:21
  • @AMC I tried for few hours with no luck, I just updated the question with the code I tried, which is not giving me the desired output – Vin Bolisetti Mar 12 '20 at 01:32
  • @VinBolisetti You could at the very least share the data in a format which is convenient for others to use. – AMC Mar 12 '20 at 01:34
  • @AMC I am so sorry, I don't stackoverflow very well, I am trying my best, I which I can make it as easy as possible, I will try – Vin Bolisetti Mar 12 '20 at 01:38
  • I have copy pasted dataframes in a block of code, plz suggest if there is a better way, thanks – Vin Bolisetti Mar 12 '20 at 01:43
  • @VinBolisetti Don’t worry, no one is born knowing everything :) You can read [ask] if you need more information. As for the data, methods like `DataFrame.to_records()` and `DataFrame.to_dict()` are good ways of producing output which can easily be shared and used to recreate the DataFrame. – AMC Mar 12 '20 at 01:43
  • @DavidCollins Hi David, could you be able to suggest any solution? – Vin Bolisetti Mar 12 '20 at 03:22
  • @VinBolisetti Based on your expected result you are saying 'Apple' and 'Apple is red' are not 80% match. `SequenceMatcher(None, 'Apple', 'Apple is Red').ratio()` gives only 0.5882352941176471. Are you expecting anything else here? or is the expected result not right? – davidbilla Mar 12 '20 at 07:05

1 Answers1

1

First you have to iterate through your both dfs and match using the conditions and set a variable in df2.

df2['match'] = False
for idx2, row2 in df2.iterrows():
    match = False
    for idx1, row1 in df1.iterrows():
        if (SequenceMatcher(None, row1['Name'], row2['Name']).ratio())>=0.8 and \
                (SequenceMatcher(None, row1['Category'], row2['Category']).ratio())>=0.8 and \
                (row1['Class'] == row2['Class'] or row1['Amt $'] == row2['Amt $']):
            match = True
            break
    df2.at[idx2, 'match'] = match

Once you have the matches, then you remove the duplicates from the ones that are matches df2['match']==True.

df2[df2['match']==True].drop_duplicates(keep='first')

Next you can join the above result with the non-matches df2['match']==False

df2[df2['match']==False].append(df2[df2['match']==True].drop_duplicates(keep='first'))

Here I am assuming you want to remove direct duplicates. Do you want to remove the duplicates based on the conditions or direct duplicates?

Based on the test data set you have here 'Apple' and 'Apple is red' are 80% match. But SequenceMatcher(None, 'Apple', 'Apple is Red').ratio() gives only 0.5882352941176471. Similarly SequenceMatcher(None, 'Fruit', 'fruits').ratio() is only 0.7272727272727273. Are you expecting anything else here? or is the expected result not right?

Anyway, I hope this gives you an idea on the approach.

EDIT 1 If you want to get the matching df1['Name'].

I have only reset df2['match'] as a string instead of boolean and assigned df1['Name'] to df2['match'] instead of assigning it to True. Then in the final df I am concatenating the df2 rows that has df2['match']==False and the non duplicate rows of df2['match']==True. Hope this helps.

df2['match'] = ''
for idx2, row2 in df2.iterrows():
    match = ''
    for idx1, row1 in df1.iterrows():
        if (SequenceMatcher(None, row1['Name'], row2['Name']).ratio())>=0.5 and \
                (SequenceMatcher(None, row1['Category'], row2['Category']).ratio())>=0.5 and \
                (row1['Class'] == row2['Class'] or row1['Amt $'] == row2['Amt $']):
            match = row1['Name']
            break
    df2.at[idx2, 'match'] = match


print(df2[df2['match']==''].append(df2[df2['match']!=''].drop_duplicates(keep='first')))
davidbilla
  • 2,120
  • 1
  • 15
  • 26
  • That is fantastic! thank you David, I want it to be marked as True if there is a logical match even if the spelling is wrong, for example cat is cute (True because there is cat) and category anemal (True because it is matching with df1 Animal and is spelled incorrectly) and for df2 Row 14 Banana is sweet (True because banana) and Category is appel (False, it should be a fruit/frut/fruits/fruit market because we are matching word with df1 Banana & Fruit) therefore Row 14 is a False, I hope I am not confusing you. – Vin Bolisetti Mar 12 '20 at 23:39
  • But this is fantastic, I have df1 220 rows data and df2 1.4Million rows data, this is for a de-duplication project, for which I am applying the logic you provided and I noticed positive results, it taking over an hour to run the script, thank you so much for your help :) – Vin Bolisetti Mar 12 '20 at 23:42
  • Glad I was able to help. You could reduce your sequence match percentage to get what you are expecting – davidbilla Mar 13 '20 at 02:45
  • Because we are iterting both dfs one within the other it will definitely take time. There may be a better solution which is optimal. – davidbilla Mar 13 '20 at 03:35
  • Sure thanks, is there a way to index the matched duplicates in df2? – Vin Bolisetti Mar 14 '20 at 06:10
  • @VinBolisetti can you explain what you're expecting? – davidbilla Mar 14 '20 at 18:48
  • if a row from df1 found a match in df2 then we are tagging as TRUE, instead of TRUE can we show the associated df1['Name']? so basically row name of df1 next to the matched duplicate in df2. – Vin Bolisetti Mar 16 '20 at 04:30
  • @VinBolisetti Please see EDIT1. Hope this matches what you are expecting. – davidbilla Mar 16 '20 at 07:41
  • thanks David! I ran the script and validated, that's exactly what is required :) – Vin Bolisetti Mar 16 '20 at 10:36
  • Done :) thanks for all your help David! Happy Coding!! – Vin Bolisetti Mar 17 '20 at 06:12