1

i.e i want to drop duplicates pairs using col1 and col2 as the subset only if the values are the opposite in col3 (one negative and one positive). similar to drop_duplicates function but i want to impose a condition and only want to remove the first pair (i.e if 3 duplicates, just remove 2, leave 1)

my dataset (df):


    col1    col2    col3
0   1        1       1
1   2        2       2
2   1        1       1
3   3        5       7
4   1        2      -1
5   1        2       1
6   1        2       1

I want:


        col1    col2    col3
    0   1        1        1
    1   2        2        2
    2   1        1        1
    3   3        5        7
    6   1        2        1

rows 4 and 5 are duplicated in col1 and col2 but value in col3 is the opposite, therefore we remove both. row 0 and row 2 have duplicate values in col1 and col2 but col3 is the same, so we don't remove those rows.

i've tried using drop_duplicates but realised it wouldn't work as it will only remove all duplicates and not consider anything else.

bbaba
  • 11
  • 3

4 Answers4

0

We can do transform

out = df[df.groupby(['col1','col2']).col3.transform('sum').ne(0) & df.col3.ne(0)]
Out[252]: 
   col1  col2  col3
0     1     1     1
1     2     2     2
2     1     1     1
3     3     5     7
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Unfortunately the code fails when a add a 6th row which has values 1,2,3 (same as the 5th row. It doesn't seem to remove any row. It should remove the 4th and 5th row still. – bbaba Nov 21 '22 at 04:20
0

Recreating the dataset:

import pandas as pd

data = [
    [1, 1,  1],
    [2, 2,  2],
    [1, 1,  1],
    [3, 5,  7],
    [1, 2, -1],
    [1, 2,  1],
    [1, 2,  1],
]

df = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])

if your data is not massive, you can use an iterrows function on a subset of the data.
The subset contains all duplicate values after all values have been turned into absolute values.
Next, we check if col3 is negative and if the opposite of col3 is in the duplicate subset.
If so, we drop the row from df.

df_dupes = df[df.abs().duplicated(keep=False)]
df_dupes_list = df_dupes.to_numpy().tolist()
for i, row in df_dupes.iterrows():
    if row.col3 < 0 and [row.col1, row.col2, -row.col3] in df_dupes_list:
        df.drop(labels=i, axis=0, inplace=True)

This code should remove row 4.
In your desired output, you left row 5 for some reason.
If you can explain why you left row 5 but kept row 0, then I can adjust my code to more accurately match your desired output.

Petar Luketina
  • 449
  • 6
  • 18
  • rows 4 and 5 should both be removed as they have the same value in col1 and col2, and col3 is the opposite and they make up a pair. – bbaba Nov 21 '22 at 07:42
  • @peter i kept row 0 because col1 and col2 matched but col3 wasn't the opposite. – bbaba Nov 21 '22 at 07:54
0

I used @Petar Luketina code here with an adjustment and it worked. However I would like to use it for a massive dataset -> 1million rows and 43 columns. This code takes forever:

df_dupes = df[df['col3'].abs().duplicated(keep=False)]
df_dupes_list = df_dupes.to_numpy().tolist()
for i, row in df_dupes.iterrows():
    if row.col3 < 0 and [row.col1, row.col2, -row.col3] in df_dupes_list:
        print(row.col3)
        try:
            c = np.where((df['col1'] ==row.col1) & (df['col2'] ==row.col2) & 
                (df['col3'] ==-row.col3))[0][0]

            df.drop(labels=[i,df.index.values[c]], axis=0, inplace=True)
        except:
            pass
bbaba
  • 11
  • 3
0

I know this is an old question, but for those people interested, here is an alternative that avoids iterating over the rows:

First use a flag to identify the pair of rows to be removed (row plus the next row when col1 and col2 are the same and col3 are the negative of each other)

df.loc[(df.col1 == df.col1.shift(1)) & (df.col2 == df.col2.shift(1)) & (df.col3 == -df.col3.shift(1)), 'removeFlag'] = True
df.loc[df.removeFlag.shift(-1) == True, 'removeFlag'] = True

   col1  col2  col3 removeFlag
0     1     1     1        NaN
1     2     2     2        NaN
2     1     1     1        NaN
3     3     5     7        NaN
4     1     2    -1       True
5     1     2     1       True
6     1     2     1        NaN

Then use this flag to delete to offending rows:

df = df[~(df.removeFlag == True)]
df.drop(columns=['removeFlag'], inplace=True)

   col1  col2  col3
0     1     1     1
1     2     2     2
2     1     1     1
3     3     5     7
6     1     2     1

This approach probably needs a little more refinement if row 6 had been the same as row 4 (ie the first half of a repeated identical pair) but you get the idea.

Mike
  • 3,722
  • 1
  • 28
  • 41