2

I am grouping a dataframe by 2 columns and i aggregate by the sum of the other columns. How I can have a total by the first grouped column in the same data frame?

for example my data frame is:

np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
               'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
               'C' : np.random.randn(8),
               'D' : np.random.randn(8)})

The result of:

grouped = df.groupby(by=['A', 'B']).sum()

is:

                  C         D
A   B                        
bar one    0.400157  0.410599
    three  2.240893  1.454274
    two   -0.977278  0.121675
foo one    2.714141  0.340644
    three -0.151357  0.333674
    two    2.846296  0.905081

I what to get:

                  C         D
A   B                        
bar one    0.400157  0.410599
    two   -0.977278  0.121675
    three  2.240893  1.454274
    total  1.663773  1.986547
foo one    2.714141  0.340644
    two    2.846296  0.905081
    three -0.151357  0.333674
    total  5.409080  1.579400

how it can be done?

UPDATE: I found a similar question at Pandas groupby and sum total of group It has 2 more answer for this question.

yona bendelac
  • 114
  • 2
  • 10

1 Answers1

2

You can get clever with pd.Categorical to create a placeholder for "total" in the groupby output. This'll make it easy to compute and assign the total back to the result.

df.B = pd.Categorical(
         df.B, categories=np.append(df.B.unique(), 'total'))
v = df.groupby(by=['A', 'B']).sum()
v.loc(axis=0)[pd.IndexSlice[:,'total']] = v.groupby(level=0).sum().values

print(v)
                  C         D
A   B                        
bar one    0.400157  0.410599
    two   -0.977278  0.121675
    three  2.240893  1.454274
    total  1.663773  1.986547
foo one    2.714141  0.340644
    two    2.846296  0.905081
    three -0.151357  0.333674
    total  5.409080  1.579400

If you need to aggregate on different metrics:

df.B = pd.Categorical(
         df.B, categories=np.append(df.B.unique(), 'total'))
idx = pd.MultiIndex.from_product([df.A.unique(), df.B.cat.categories]) 

v = df.groupby(by=['A', 'B']).agg(['sum', 'count']).reindex(idx)
v.loc(axis=0)[pd.IndexSlice[:,'total']] = v.groupby(level=0, sort=False).sum().values

print(v)
                  C               D      
                sum count       sum count
foo one    2.714141   2.0  0.340644   2.0
    two    2.846296   2.0  0.905081   2.0
    three -0.151357   1.0  0.333674   1.0
    total  5.409080   5.0  1.579400   5.0
bar one    0.400157   1.0  0.410599   1.0
    two   -0.977278   1.0  0.121675   1.0
    three  2.240893   1.0  1.454274   1.0
    total  1.663773   3.0  1.986547   3.0

Another alternative is pivot_table which makes margin generation easier (although does not provide sub-level margins):

df.pivot_table(index=['A', 'B'], 
               values=['C', 'D'], 
               aggfunc=['sum', 'count'], 
               margins=True)

                sum           count     
                  C         D     C    D
A   B                                   
bar one    0.400157  0.410599   1.0  1.0
    two   -0.977278  0.121675   1.0  1.0
    three  2.240893  1.454274   1.0  1.0
foo one    2.714141  0.340644   2.0  2.0
    two    2.846296  0.905081   2.0  2.0
    three -0.151357  0.333674   1.0  1.0
All        7.072852  3.565947   8.0  8.0
yona bendelac
  • 114
  • 2
  • 10
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    Hi coldspeed, I think this a great answer that solves the problem of OP. I'm just wondering: are there easier ways to get subtotals (in all forms). Coming up with this answer takes quite some pandas experience, while the question itself is for analysts and reporters quite a common one. – Sander van den Oord Dec 11 '18 at 09:03
  • It is good and work for one aggregate data but not for two or more (like I need) it work for: df.groupby(by=['A', 'B']).agg('count') But not for: df.groupby(by=['A', 'B']).agg(['count','sum']) and it is not working for df.groupby(by=['A', 'B']).agg(['count']) too (because the bracket??) – yona bendelac Dec 11 '18 at 09:08
  • @SandervandenOord Most of the time, user requirements for margins are easily satisfied by functions like crosstab and pivot_table (see edit). Subtotals from groupby are rarely requested, at least, from what I've seen in my time here. – cs95 Dec 11 '18 at 09:35
  • Thanks very much @coldspeed ! – Sander van den Oord Dec 11 '18 at 09:42
  • @coldspeed the part of "aggregate on different metrics" is what I need but it have an issue. "groupby", sort level 0 when idx is not sorted. so when you reindex you un-sort the result but "v.groupby(level=0).sum().values" sort the result. at the result you put in "total bar" the sum of "total foo" and vice versa. So you need to change the last line to "v.groupby(level=0, sort=False).sum()" or maybe better, sort idx (not sure how) and It probably will work for 2 level of index but not sure for 3 level and more (I have only 2 level so it will be good, I still need to check it) – yona bendelac Dec 11 '18 at 11:34
  • @yonabendelac Feel free to make the necessary edits to my answer as you see fit; I will approve your edit. – cs95 Dec 11 '18 at 12:03
  • I don't like the "other metrics" part, as it will work only with additive aggreg function. It would not work on mean() for example, as you can't sum the mean(). Nad if you decide the second part to add the mean() an not sum(), you can't neither average the mean() – el Josso Mar 24 '22 at 10:31