Remove duplicate rows from DataFrame but keeping one column as list- Python

Question

I have a dataframe like this:

file:
      | FIRST | LAST | ID |
---------------------------
0      "ABC"     12    35 
1      "ABC"     14    35
2      "AB"      15    36

Now, what I want is:

file:
      | FIRST | LAST  | ID |
---------------------------
0      "ABC"   [12,14]  35 
2      "AB"      15     36

For this problem let's assume that if ID of two rows is equal then all the values except LAST is also equal.

Therefore, replace all the value except the values of last, which are added to a list.

I tried using solution given in this link: Pandas DataFrame - Combining one column's values with same index into list

I used this:

file = file.groupby('ID')

file = file['Last'].unique()

This is the output I got:

ID
35    [12, 14]
36        [15]
Name: Last, dtype: object

Probably, I am missing something in the groupby().

Thanks in advance :)

UPDATE:

My original Dataframe has more than 100 columns. if ID of two rows is equal then all the values except LAST is also equal.

score 2 · Answer 1 · answered Aug 15 '17 at 22:49

2

Is this what you want?

df.groupby(['FIRST', 'ID']).LAST.apply(lambda x: x.tolist()).reset_index()

    FIRST   ID  LAST
0   AB      36  [15]
1   ABC     35  [12, 14]

answered Aug 15 '17 at 22:49

Vaishali

37,545
5
58
86

My Dataframe has more than 100 columns so when I put ['First','ID'] in groupby, all the other columns won't be there. – Harinder Singh Aug 15 '17 at 22:53
In that case, even the other columns would need to be aggregated, do you want them in list as well? – Vaishali Aug 15 '17 at 22:56
As I mentioned in the question, if value of ID is equal for two rows then all the other column values are also equal, except LAST. – Harinder Singh Aug 15 '17 at 22:58

Alexander · Accepted Answer · 2017-08-16T21:23:48.110

2

Given that only the last two rows are different for a given ID, just take the first value when applying a groupby to them. For the column `LAST', use its value or convert it to a list of unique items if there are more than one.

grouping_cols = ['ID', ...]
agg_cols = {col: 'first' for col in df if col not in grouping_cols}
agg_cols['LAST'] = lambda x: x.unique().tolist() if len(x) > 1 else x.iat[0]
>>> df.groupby(grouping_cols, as_index=False).agg(agg_cols)
  ID      LAST FIRST
0  35  [12, 14]   ABC
1  36        15    AB

edited Aug 16 '17 at 21:23

answered Aug 15 '17 at 23:11

Alexander

105,104
32
201
196

what if i want to groupby() using multiple columns? – Harinder Singh Aug 16 '17 at 20:27
This is what i got: **ValueError: function does not reduce** – Harinder Singh Aug 16 '17 at 21:30

Remove duplicate rows from DataFrame but keeping one column as list- Python

2 Answers2