Groupby sum, index vs. column results

Question

For the following dataframe:

df = pd.DataFrame({'group':['a','a','b','b'], 'data':[5,10,100,30]},columns=['group', 'data']) 
print(df)

  group  data
0     a     5
1     a    10
2     b   100
3     b    30

When grouping by column, adding and creating a new column, the result is:

df['new'] = df.groupby('group')['data'].sum() 
print(df)

  group  data  new
0     a     5  NaN
1     a    10  NaN
2     b   100  NaN
3     b    30  NaN

However if we reset the df to the original data and move the group column to the index,

df.set_index('group', inplace=True)
print(df)

       data
group      
a         5
a        10
b       100
b        30

And then group and sum, then we get:

df['new'] = df.groupby('group')['data'].sum() 
print(df)

       data  new
group           
a         5   15
a        10   15
b       100  130
b        30  130

Why does the column group not set the values in the new column but the index grouping does set the values in the new column?

jezrael · Accepted Answer · 2019-05-17T09:06:57.993

Better here is use GroupBy.transform for return Series with same size like original DataFrame, so after assign all working correctly:

df['new'] = df.groupby('group')['data'].transform('sum')

Because if assign new Series values are align by index values. If index is different, get NaNs:

print (df.groupby('group')['data'].sum())
group
a     15
b    130
Name: data, dtype: int64

Different index values - get NaNs:

print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')

print (df.index)
RangeIndex(start=0, stop=4, step=1)

df.set_index('group', inplace=True)

print (df.groupby('group')['data'].sum())
group
a     15
b    130
Name: data, dtype: int64

Index can align, because values matched:

print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')

print (df.index)
Index(['a', 'a', 'b', 'b'], dtype='object', name='group')

yatu · Answer 2 · 2019-05-17T09:13:31.860

You're not getting what you want because when using df.groupby('group')['data'].sum(), this is returning an aggregated result with group as index:

group
a     15
b    130
Name: data, dtype: int64

Where clearly indexes are not aligned.

If you want this to work you'll have to use transform, which returns a Series with the transformed vales which has the same axis length as self:

df['new'] = df.groupby('group')['data'].transform('sum')

   group  data  new
0     a     5   15
1     a    10   15
2     b   100  130
3     b    30  130

Groupby sum, index vs. column results

2 Answers2

Linked