0

I have dataFrame and I need to drop duplicates per group ('col1') based on a minimum value in another column 'abs(col1 - col2)', but I need to change this condition for the last group by taking the max value in 'abs(col1 - col2)' that corresponding to the last group in 'col1' where I sorted the 'col1' with ascending order. (to behave as a loop)

Update 1 :

I need to assign the last group dynamically.

for example, If I have a data frame as

  • creating DataFrame

df = pd.DataFrame( {'col0':['A','A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','B','B'],'col1':[1,1,1,2,2,2,3,3,3,4,4,4,2,2,2,3,3,3,4,4,4,5,5,5], 'col2':[2,3,4,1,3,4,1,2,4,1,2,3,3,4,5,2,4,5,2,3,5,2,3,4]})

compute Diff column (this column will be used as a condition)

df['abs(col1 - col2)']=abs(df['col1']-df['col2'])

  • The original Df as follow :

enter image description here

  • The desired Df should looks like:

enter image description here

  • my trial:

    df.sort_values(by=['col0','col1','abs(col1 - col2)','col2'],ascending=[True,True,True,False]).drop_duplicates(['col0','col1'])

  • the resulting as follow:

enter image description here

Ahmad Senousi
  • 613
  • 2
  • 12
  • 24
Sidhom
  • 935
  • 1
  • 8
  • 15

2 Answers2

2

Updated:

If I understand correctly, you want each a different sort order each time you reach a maximum on col1.

  1. Extract the groups where sorting is different:
df.groupby(['col0'], as_index=False)['col1'].max()
  1. Deduplicate df as you did
  2. Deduplicate just the groups found in step 1 with the correct sorting. You can get these groups by merging with the original df:
pd.merge(df, col1_max_groups)
  1. Update the deduplicated DataFrame with the new values

Full example:

col1_max_groups = df.groupby(['col0'], as_index=False)['col1'].max()
deduped = df.sort_values(['col0', 'col1', 'abs(col1 - col2)', 'col2'], 
                         ascending=[True, True, True, False]) \
    .drop_duplicates(['col0', 'col1']) \
    .set_index(['col0', 'col1'])
update = pd.merge(df, col1_max_groups) \
    .sort_values(['col0', 'col1', 'abs(col1 - col2)', 'col2'], 
                 ascending=[True, True, False, False]) \
    .drop_duplicates(['col0', 'col1'])
deduped.update(update.set_index(['col0', 'col1']))
deduped.reset_index()

# returns
# col0  col1  col2  abs(col1 - col2)
#    A     1     2                 1
#    A     2     3                 1
#    A     3     4                 1
#    A     4     1                 3
#    B     2     3                 1
#    B     3     4                 1
#    B     4     5                 1
#    B     5     2                 3
onepan
  • 946
  • 5
  • 8
1

If this particular result is what you want to have you can split the dataframe and use two different rules, and then concat them again. As example:

import pandas as pd

df = pd.DataFrame( {'col1':[1,1,1,2,2,2,3,3,3,4,4,4], 'col2':[2,3,4,1,3,4,1,2,4,1,2,3]})

df['abs(col1 - col2)']=abs(df['col1']-df['col2'])

df = df.sort_values(by=['col1','abs(col1 - col2)','col2'],ascending=[True,True,False]).drop_duplicates('col1')
df1 = df.loc[df['col1'] != 4]

df2 = df.loc[df['col1'] == 4]
df2 = df2.sort_values(by=['col1','abs(col1 - col2)','col2'],ascending=[True,True,False])
df2Last = df2.tail(1)

df = pd.concat([df1, df2Last])

result:

col1  col2  abs(col1 - col2)
 1     2                 1
 2     3                 1
 3     4                 1
 4     1                 3
cccnrc
  • 1,195
  • 11
  • 27
  • Thanks for your answer. I don't need to assign ```4``` directly. If I have another column considered as a group column as illustrated in update 1. then I need to obtain the desired ```DF``` dynamically. – Sidhom Mar 28 '19 at 04:53