How to remove duplicates and keep values of all columns

Question

I have a df like below

  Date         ID Colour ColourCode    Item ItemCode
0 2020-01-02  245   Blue         Bl   Apple      NaN
1 2020-01-02  245   Blue        NaN   Apple       Ap
2 2020-01-03  245   Blue         Bl  Orange      NaN
3 2020-01-03  245  Green        NaN   Apple       Ap
4 2020-01-05  472    Red         Re   Grape      NaN
5 2020-01-05  472    Red        NaN   Grape       Gr
6 2020-01-07  472    Red         Re  Banana      NaN

df.to_dict()
{'Date': {0: Timestamp('2020-01-02 00:00:00'), 1: Timestamp('2020-01-02 00:00:00'), 2: Timestamp('2020-01-03 00:00:00'), 3: Timestamp('2020-01-03 00:00:00'), 4: Timestamp('2020-01-05 00:00:00'), 5: Timestamp('2020-01-05 00:00:00'), 6: Timestamp('2020-01-07 00:00:00')}, 'ID': {0: 245, 1: 245, 2: 245, 3: 245, 4: 472, 5: 472, 6: 472}, 'Colour': {0: 'Blue', 1: 'Blue', 2: 'Blue', 3: 'Green', 4: 'Red', 5: 'Red', 6: 'Red'}, 'ColourCode': {0: 'Bl', 1: nan, 2: 'Bl', 3: nan, 4: 'Re', 5: nan, 6: 'Re'}, 'Item': {0: 'Apple', 1: 'Apple', 2: 'Orange', 3: 'Apple', 4: 'Grape', 5: 'Grape', 6: 'Banana'}, 'ItemCode': {0: nan, 1: 'Ap', 2: nan, 3: 'Ap', 4: nan, 5: 'Gr', 6: nan}}

Any rows having same Date, ID, Colour, Item would be considered duplicates. I just want to keep one row but this row would contain values of both columns ColourCode and ItemCode.
My expected result looks like

 Date          ID Colour ColourCode Item ItemCode
0 2020-01-02  245   Blue         Bl   Apple       Ap
1 2020-01-03  245   Blue         Bl  Orange      NaN
2 2020-01-03  245  Green        NaN   Apple       Ap
3 2020-01-05  472    Red         Re   Grape       Gr
4 2020-01-07  472    Red         Re  Banana      NaN

Are there any ways of doing it in Python?

"I just want to keep one row but this row would contain values of both columns ColourCode and ItemCode. " I can't understand. Does this mean you tried something already and got a wrong result? We can only help fix things that are actually shown to us. — Karl Knechtel, Oct 09 '22 at 06:00
@KarlKnechtel That's what I want to achieve but I didn't know how to do about it. — Shichimi, Oct 09 '22 at 23:59

score 1 · Answer 1 · answered Oct 09 '22 at 06:33

Assuming pandas and using fillna, groupby, and aggregate:

df = df.fillna("")
aggregation_functions = {'ColourCode': 'sum', 'ItemCode': 'sum'}
df = df.groupby(['Date','ID','Colour', 'Item'], as_index=False).aggregate(aggregation_functions)

Output:

        Date        ID   Colour  Item      ColourCode ItemCode                                              
     0  01/02/2020  245   Blue   Apple         Bl       Ap                     
     1  01/03/2020  245   Blue  Orange         Bl                                                                            
     2  01/03/2020  245  Green   Apple                  Ap                                                                   
     3  01/05/2020  472    Red   Grape         Re       Gr                                                                   
     4  01/07/2020  472    Red  Banana         Re

Did not know we can use sum of strings like that. Works well for me. Many thanks! — Shichimi, Oct 13 '22 at 22:33

score 0 · Accepted Answer · answered Oct 09 '22 at 06:23

Assuming pandas, you can use groupby.first and reindex:

out = (df
 .groupby(['Date', 'ID', 'Colour', 'Item'], as_index=False)
 .first()[df.columns]
 )

Output:

        Date   ID Colour ColourCode    Item ItemCode
0 2020-01-02  245   Blue         Bl   Apple       Ap
1 2020-01-03  245   Blue         Bl  Orange     None
2 2020-01-03  245  Green       None   Apple       Ap
3 2020-01-05  472    Red         Re   Grape       Gr
4 2020-01-07  472    Red         Re  Banana     None

Completely solve my problem! Thanks @mozway – Shichimi Oct 13 '22 at 22:31 — Shichimi, Oct 13 '22 at 22:31

How to remove duplicates and keep values of all columns

2 Answers2