removing duplicate rows in pandas with multiple conditions

Question

I have

ID  Field1  Field2  Field3  Field4
1    A          1     6       F
2    A          0     1       F
3    B          2     4       F
4    B          3     8       F

I want

ID  Field1  Field2  Field3  Field4
1    A          1     1       F
4    B          3     4       F

I want to have 1 row per distinct 'Field1'. There are over 100 fields and rows are duplicated due to 'Field2' and 'Field3' (all other fields are the same like Field4).

In my final df I want the max of 'Field2' and min of 'Field3' to be returned.

Adding a comment to help distinguish between this question and the dup target. In this question, OP asked for all columns back but may not have anticipated that they needed to be aggregated. They couldn't have gathered that missing piece from dup target. Other than that, I agree with the duping. Enough so, I won't hammer it open again. — piRSquared, Apr 06 '18 at 00:47

score 3 · Accepted Answer · answered Apr 06 '18 at 00:08

Use agg in a groupby context

df.groupby('Field1', as_index=False).agg(
    {'Field2': 'max', 'Field3': 'min'}
)

  Field1  Field2  Field3
0      A       1       1
1      B       3       4

Mind you, you can't also magically include 'ID' and 'Field4' because it could come from any of the rows within the group. You'd need another aggregation function that enables you to choose those fields as well.

I'll use 'first' for 'ID' and 'last' for 'Field4' for demonstration purposes.

df.groupby('Field1', as_index=False).agg(
    {'ID': 'first', 'Field2': 'max', 'Field3': 'min', 'Field4': 'last'}
).reindex(columns=df.columns)

   ID Field1  Field2  Field3 Field4
0   1      A       1       1      F
1   3      B       3       4      F

removing duplicate rows in pandas with multiple conditions

1 Answers1