Aggregate on multiple columns with one attribute at the agg function

Question

Let's suppose that I have a pandas dataFrame (data_stores) similar to the following:

store| item1 | item2 | item3
------------------------------
1    | 45    | 50    | 53  
1    | 200   | 300   | 250
2    | 20    | 17    | 21  
2    | 300   | 350   | 400

Let's say that I want to aggregate on column item1 with the mean and on columns item2 and item3 with the sum.

This could be commonly done in the following way:

data_stores_total= data_stores.groupby(['store'], as_index=False).agg({'item1': 'mean', 'item2': 'sum', 'item3': 'sum' })

However, this cannot be done (more efficiently) in the following way:

 data_stores_total= data_stores.groupby(['store'], as_index=False).agg({'item1': 'mean', ['item2', 'item3']: 'sum' })

neither in the following way which makes more sense for dictionary keys:

 data_stores_total= data_stores.groupby(['store'], as_index=False).agg({'mean': 'item1':, 'sum': ['item2', 'item3']})

Is there any way to do an aggregation with the same function on some columns of a dataframe without writing a new dictionary attribute at the agg function for each of them?

jezrael · Accepted Answer · 2019-01-16T14:43:36.253

2

It is not possible, only you can define dictionary with keys for functions and list for columns names, and then swap keys with values in loop:

data_stores = pd.DataFrame({'store': [1, 1, 2, 2], 
                           'item1': [45, 200, 20, 300], 
                           'item2': [50, 300, 17, 350], 
                           'item3': [53, 250, 21, 400]})
print (data_stores)
   store  item1  item2  item3
0      1     45     50     53
1      1    200    300    250
2      2     20     17     21
3      2    300    350    400


d = {'mean':'item1', 'sum' : ['item2', 'item3']}

out = {}
for k, v in d.items():
    if isinstance(v, list):
        for x in v:
            out[x] = k
    else:
        out[v] = k

print (out)
{'item1': 'mean', 'item2': 'sum', 'item3': 'sum'}

data_stores_total = data_stores.groupby('store', as_index=False).agg(out)
print (data_stores_total)
   store  item1  item2  item3
0      1  122.5    350    303
1      2  160.0    367    421

Or:

d = {'mean':['item1'], 'sum' : ['item2', 'item3']}

d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'item1': 'mean', 'item2': 'sum', 'item3': 'sum'}

data_stores_total = data_stores.groupby('store', as_index=False).agg(d1)
print (data_stores_total)
   store  item1  item2  item3
0      1  122.5    350    303
1      2  160.0    367    421

EDIT:

If want aggregate all columns without few by same aggregate function, you can create dictionary by all columns with filter out by list with difference and then add missing pairs key: value for column: aggregate function:

out = dict.fromkeys(data_stores.columns.difference(['store','item1']), 'sum')
out['item1'] = 'mean'
print (out)
{'item2': 'sum', 'item3': 'sum', 'item1': 'mean'}

data_stores_total = data_stores.groupby('store', as_index=False).agg(out)
print (data_stores_total)
   store  item2  item3  item1
0      1    350    303  122.5
1      2    367    421  160.0

You can also pass custom function working with this column:

def func(x):
    return x.sum() / x.mean()

out = dict.fromkeys(data_stores.columns.difference(['store','item1']), 'sum')
out['item1'] = func
print (out)
{'item2': 'sum', 'item3': 'sum', 'item1': <function func at 0x000000000F3950D0>}

data_stores_total = data_stores.groupby('store', as_index=False).agg(out)
print (data_stores_total)
   store  item2  item3  item1
0      1    350    303      2
1      2    367    421      2

edited Jan 16 '19 at 14:43

answered Jan 16 '19 at 13:09

jezrael

822,522
95
1,334
1,252

@PoeteMaudit - No, it is not possible. But is possible specify it in list and then swap values like in edited answer. – jezrael Jan 16 '19 at 13:23
Hm, it is a pity I guess that it is not possible on `pandas`; thank you for your answer though(upvote). The problem becomes even bigger when for example I have 10 columns and I want to use a function on the two of them while I want to use a different one to the rest of them. Would I have to write a new dict attribute for each of the 8 columns of the latter group (unless I go for your source code)? I thought `pandas` would have taken care of this obvious thing. – Outcast Jan 16 '19 at 13:28
@PoeteMaudit - yes, give me some time. – jezrael Jan 16 '19 at 13:28
Yes no worries. – Outcast Jan 16 '19 at 13:29
@PoeteMaudit - check edited answer - create dictionary with columns with filter by difference and then add key with value `mean`. – jezrael Jan 16 '19 at 13:47
Thanks again. I am not sure what the line with `difference` exactly does because I have not used `difference` before but essentially (if I am not wrong) you are creating a dict by using as keys the dataframe column names and as value for all of them the `sum` and then you simply replace the value of `item1` with `mean`. This is something but as you may guess if you have many columns perhaps it is not exactly better than simply writing a new attribute for each column at the agg function from the very beginning. I am in general surprised that `pandas` has not taken care of this. – Outcast Jan 16 '19 at 14:02
@PoeteMaudit - Yes, ir is not implemented yet, but you can create new enhacement issue in [pandas github](https://github.com/pandas-dev/pandas/issues) – jezrael Jan 16 '19 at 14:04
Ok cool. By the way, I would suggest you to use the `.columns` at your answers so that your answers are more generalised and reproducible than only specifically using a dataframe and/or dict for`store`, `item1` etc. – Outcast Jan 16 '19 at 14:05
@PoeteMaudit - Not sure if understand, but I try improve answer. – jezrael Jan 16 '19 at 14:10
No worries. I meant that for example at your first answer you write `d = {'mean':'item1', 'sum' : ['item2', 'item3']}` but it would be better to use `.columns` instead of the specific names of my dataframe so that your answer is more general. I think that your last edit was towards this direction. – Outcast Jan 16 '19 at 14:12
Can I make it a bit harder please? If you want to use a custom function instead of a standard one (`mean`, `sum`)? – Outcast Jan 16 '19 at 14:41

Aggregate on multiple columns with one attribute at the agg function

1 Answers1