pandas equivalent of groub by with STRING_AGG for several columns

Question

So is there a way to groupBy a DataFrame object and then for the remaining columns, put all the entries into a set (or list with filtered unique values). So something like this

   Name        Date  Amount   purchase
0  Jack  2016-01-31      10    'apple'
1  Jack  2016-02-29       5        25
2  Jack  2016-02-29       8       'B+'
3  Jill  2016-01-31      10       'D2'
4  Jill  2016-02-29       5         E
4  Jill  2016-02-29       5         E

and output after grouping by the first two columns.

    Name        Date  Amount     purchase
 0  Jack  2016-01-31     [10]     [apple]
 1  Jack  2016-02-29    [5,8]   [25,'B+']
 3  Jill  2016-01-31     [10]      ['D2']
 4  Jill  2016-02-29      [5]       ['E']

So I can do it for each column with df_data = df.groupby(['Name', 'Date'])['Amount'].apply(set) and then concatenate them, however if the list was long, is there a shorter more elegant solution?

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Don't do this

Pandas was never designed to hold lists in series / columns. You can concoct expensive workarounds, but these are not recommended.

The main reason holding lists in series is not recommended is you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks. Your series will be of object dtype, which represents a sequence of pointers, much like list.

Of course, object dtype is unavoidable with mixed types. But, here, the effect is compounded by a nested pointer structure. You will lose benefits in terms of memory and performance, as well as access to optimized Pandas methods.

A slightly better alternative

You can aggregate to strings, so that you have only one level of pointers:

res = df.groupby(['Name', 'Date'], as_index=False)[['Amount', 'purchase']]\
        .agg(lambda x: ', '.join(map(str, set(x))))

print(res)
    
   Name        Date Amount  purchase
0  Jack  2016-01-31     10   'apple'
1  Jack  2016-02-29   8, 5  'B+', 25
2  Jill  2016-01-31     10      'D2'
3  Jill  2016-02-29      5         E

Thank you for the warning, I would keep it in mind. The answer however resulted in just a column with the `['Amount', 'purchase']` in it. Could be that I am just not sure how to access them. — Todor Popov, Sep 13 '18 at 06:38
@TodorPopov, Have updated with `str` conversion, works fine for me now. — jpp, Sep 13 '18 at 08:21

score 1 · Accepted Answer · edited May 20 '22 at 06:16

1

You can use aggregate function of groupby. Also, you will have to reset the index if want columns from MultiIndex by levels Name and Date.

df_data = df.groupby(['Name', 'Date']).aggregate(lambda x: list(x)).reset_index()

edited May 20 '22 at 06:16

jezrael

822,522
95
1,334
1,252

answered Sep 12 '18 at 16:02

Raunaq Jain

917
7
13

1

Instead of resetting the index at the end, you can also use `df.groupby(['Name', 'Date'], as_index=False)...`. But that is not really shorter... – Graipher Sep 12 '18 at 16:07
Thanks, this lead me to the idea of `agg(lambda x: set(x))` , that worked fine for me, don't know why list was returning _Function does not reduce_ . – Todor Popov Sep 13 '18 at 06:33
You can also go for a tuple instead of a set. – Raunaq Jain Sep 13 '18 at 06:46

pandas equivalent of groub by with STRING_AGG for several columns

2 Answers2

Don't do this

A slightly better alternative

Linked