Pandas - dataframe groupby - how to get sum of multiple columns

Question

This should be an easy one, but somehow I couldn't find a solution that works.

I have a pandas dataframe which looks like this:

index col1   col2   col3   col4   col5
0     a      c      1      2      f 
1     a      c      1      2      f
2     a      d      1      2      f
3     b      d      1      2      g
4     b      e      1      2      g
5     b      e      1      2      g

I want to group by col1 and col2 and get the sum() of col3 and col4. col5 can be dropped since the data can not be aggregated.

Here is what the output should look like. I am interested in having both col3 and col4 in the resulting dataframe. It doesn't really matter if col1 and col2 are part of the index or not.

index col1   col2   col3   col4   
0     a      c      2      4          
1     a      d      1      2      
2     b      d      1      2      
3     b      e      2      4

Here is what I tried:

df_new = df.groupby(['col1', 'col2'])['col3', 'col4'].sum()

That however only returns the aggregated results of col4.

I am lost here. Every example I found only aggregates one column, where the issue obviously doesn't occur.

The issue is likely that `df.col3.dtype` is likely not an `int` or a numeric datatype. Try `df.col3 = df.col3.astype(int)` before doing your `groupby` — A.Kot, Sep 26 '17 at 16:16

BENY · Accepted Answer · 2018-04-17T20:26:22.057

76

By using apply

df.groupby(['col1', 'col2'])["col3", "col4"].apply(lambda x : x.astype(int).sum())
Out[1257]: 
           col3  col4
col1 col2            
a    c        2     4
     d        1     2
b    d        1     2
     e        2     4

If you want to agg

df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})

edited Apr 17 '18 at 20:26

answered Sep 26 '17 at 16:14

BENY

317,841
20
164
234

1

Why is this rated so much higher than ```df.groupby([col1','col2'])[['col3','col4']].sum()``` – Justin Beresford Nov 14 '22 at 18:04
@JustinBeresford Here is what I tried: df_new = df.groupby(['col1', 'col2'])['col3', 'col4'].sum() That however only returns the aggregated results of col4. – BENY Nov 14 '22 at 18:35
The `apply(lambda)` soluction results in this warning for me: `Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.` – MikeB Mar 01 '23 at 02:19
1

@MikeB `["col3", "col4"]` should be `[["col3", "col4"]]` in order to select columns by passing a list. Without the inner square brackets, `"col3", "col4"` is interpreted as a tuple. @BENY would be great if you could edit the answer as it asks me for at least 6 character changes when trying to do so. – Ouroboroski Aug 22 '23 at 15:42

Prateek Sharma · Answer 2 · 2021-08-11T16:31:12.283

Another generic solution is

df.groupby(['col1','col2']).agg({'col3':'sum','col4':'sum'}).reset_index()

This will give you the required output.

UPDATED (June 2020): Introduced in Pandas 0.25.0, Pandas has added new groupby behavior “named aggregation” and tuples, for naming the output columns when applying multiple aggregation functions to specific columns.

df.groupby(['col1','col2']).agg(
     sum_col3 = ('col3','sum'),
     sum_col4 = ('col4','sum'),
     ).reset_index()

Also, you can name new columns, e.g. I've used 'sum_col3' and 'sum_col4', but you can use any name you want.

Refer to Link for detailed description.

score 10 · Answer 3 · answered Apr 23 '21 at 04:03

10

Due to pandas FutureWarning: Indexing with multiple keys discussed on GitHub and Stack Overflow, I recommend this solution:

df.groupby(['col1', 'col2'])[['col3', 'col4']].sum().reset_index()

Output:

answered Apr 23 '21 at 04:03

oil_lamp

482
7
9

Leo James · Answer 4 · 2018-09-27T00:41:44.527

6

The above answer didn't work for me.

df_new = df.groupby(['col1', 'col2']).sum()[["col3", "col4"]]

I was grouping by single group by and sum columns.

Here is the one worked for me.

D1.groupby(['col1'])['col2'].sum() << The sum at the end not the middle.

edited Sep 27 '18 at 00:41

answered Sep 26 '18 at 22:43

Leo James

151
1
3

Have you tried :df_new = df.groupby(['col1', 'col2'])[["col3", "col4"]].sum() ? This one worked for me. – marcogemaque Nov 04 '20 at 12:20

score 3 · Answer 5 · answered Sep 26 '17 at 16:17

The issue is likely that df.col3.dtype is likely not an int or a numeric datatype. Try df.col3 = df.col3.astype(int) before doing your groupby

Additionally, select your columns after the groupby to see if the columns are even being aggregated:

df_new = df.groupby(['col1', 'col2']).sum()[["col3", "col4"]]

score 2 · Answer 6 · answered Nov 27 '19 at 23:19

I think it would be more efficient to do the following:

df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'}).sum(axis=1)

or:

df.groupby(['col1', 'col2'])['col3', 'col4'].sum().sum(axis=1)

This does assume you have appropriate types in the dataframe.

Mykola Zotko · Answer 7 · 2022-04-28T07:46:04.147

2

You miss a pair of square brackets to select necessary columns in your DataFrameGroupBy object:

df.groupby(['col1', 'col2'])[['col3', 'col4']].sum()

You can also drop columns you don't need before groupby:

df.drop('col5', axis=1).groupby(['col1', 'col2']).sum()

edited Apr 28 '22 at 07:46

answered Apr 28 '22 at 07:34

Mykola Zotko

15,583
3
71
73

Pandas - dataframe groupby - how to get sum of multiple columns

7 Answers7

Linked

Related