Python: Summarizing & Aggregating Groups and Sub-groups in DataFrame

Question

I am trying to build a table that has groups that are divided by subgroups with count and average for each subgroup. For example, I want to convert the following data frame:

To a table that looks like this where the interval is a bigger group and columns a thru i become subgroups within the group with the corresponding subgroups' count and average in each cell:

I have tried this with no success:

score 2 · Answer 1 · answered Apr 13 '19 at 09:10

2

Try.

df.groupby(['interval']).apply(lambda x : x.stack()
                                           .groupby(level=-1)
                                           .agg({'count', 'mean'}))

Use groupby with apply to apply a function for each group then stack and groupby again with agg to find count and mean.

answered Apr 13 '19 at 09:10

ResidentSleeper

2,385
2
10
20

I get an error message saying **TypeError: count() takes 1 positional argument but 2 were given** – user9532692 Apr 13 '19 at 17:17
@user9532692 It worked fine for me. I'm not sure what happened. Can you try to use only `mean` or replace `count` with `size`? – ResidentSleeper Apr 13 '19 at 17:51
I am still getting an error message saying **DataError: No numeric types to aggregate**. I wondered if it's because the data values are not in integer, but running *type(hh.iloc[1,1])* returns *numpy.int64* – user9532692 Apr 13 '19 at 22:01

jezrael · Accepted Answer · 2019-04-14T08:36:25.093

1

Use DataFrame.melt with GroupBy.agg and tuples for aggregate functions with new columns names:

df1 = (df.melt('interval', var_name='source')
         .groupby(['interval','source'])['value']
         .agg([('cnt','count'), ('average','mean')])
         .reset_index())
print (df1.head())
  interval source  cnt  average
0        0      a    1      5.0
1        0      b    1      0.0
2        0      c    1      0.0
3        0      d    1      0.0
4        0      f    1      0.0

edited Apr 14 '19 at 08:36

answered Apr 14 '19 at 08:16

jezrael

822,522
95
1,334
1,252

The result looks like it's counting items that fall into each interval only. How can you make it to count items that fall into both _interval_ and _source_? – user9532692 Apr 14 '19 at 22:11
@user9532692 - I am confused with output, can you explain row `6~9 a 2 7` from output? why `2`? why `7` ? – jezrael Apr 15 '19 at 07:08
2 is the count every item that falls in the interval 6~9 **AND** source a. 7 is the average of those two items that falls in the same group (interval 6~9 AND where source is a) – user9532692 Apr 20 '19 at 00:01

score 0 · Answer 3 · answered Apr 20 '19 at 00:16

The following code solves the problem I asked for:

df.group(['interval'],,as_index=False).agg({
                                                'a':{"count":"mean"},
                                                'b':{"count":"mean"},
                                                'c':{"count":"mean"},
                                                'd':{"count":"mean"},
                                                'f':{"count":"mean"},
                                                'g':{"count":"mean"},
                                                'i':{"count":"mean"}
                                            })

Python: Summarizing & Aggregating Groups and Sub-groups in DataFrame

3 Answers3

Linked