Pythonic / Panda Way to Create Function to Groupby

Question

I am fairly new to programming & am looking for a more pythonic way to implement some code. Here is dummy data:

 df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000), 
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',  
                  freq='D'), 10000)})

I have lots of transactional data like that that I perform various Groupby's on. My current solution is to make a master groupby like this:

master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()

From there, I perform various groupbys using .groupby(level=) function to aggregate the information in the way I'm looking for. I usually make a summary at each level. In addition, I create sub-totals at each level using some variation of the below code.

y = master.groupby(level=[0,1,2]).sum()
y.index = pd.MultiIndex.from_arrays([
    y.index.get_level_values(0),
    y.index.get_level_values(1),
    y.index.get_level_values(2) + ' Total',
    len(y.index)*['']
])

y1 = master.groupby(level=[0,1]).sum()
y1.index = pd.MultiIndex.from_arrays([
    y1.index.get_level_values(0),
    y1.index.get_level_values(1)+ ' Total',
    len(y1.index)*[''],
    len(y1.index)*['']
])

y2 = master.groupby(level=[0]).sum()
y2.index = pd.MultiIndex.from_arrays([
    y2.index.get_level_values(0)+ ' Total',
    len(y2.index)*[''],
    len(y2.index)*[''],
    len(y2.index)*['']
])

pd.concat([master,y,y1,y2]).sort_index()\
    .assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])\
    .assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)\
    .dropna(how='all')\

This is just an example - I may perform the same exercise, but perform the groupby in a different order. For example - next I may want to group by 'Category', 'Product', then 'Customer', so I'd have to do: master.groupby(level=[1,3,0).sum()

Then I will have to repeat the whole exercise for sub-totals like above. I also frequently change the time period - could be year-ending a specific month, could be year to date, could be by quarter, etc.

From what I've learned so far in programming (which is minimal, clearly!), you should look to write a function any time you repeat code. Obviously I am repeating code over & over again in this example.

Is there a way to construct a function where you can provide the levels to Groupby, along with the time frame, all while creating a function for sub-totaling each level as well?

Thanks in advance for any guidance on this. It is very much appreciated.

Is your `y2` really what you mean, or should it have `level=[0]`? — DSM, Sep 06 '18 at 16:08

Parfait · Accepted Answer · 2018-09-06T19:28:29.193

For a DRY-er solution, consider generalizing your current method into a defined module that filters original data frame by date ranges and runs aggregations, receiving the group_by levels and date ranges (latter being optional) as passed in parameters:

Method

def multiple_agg(mylevels, start_date='2016-01-01', end_date='2018-12-31'):

    filter_df = df[df['Date'].between(start_date, end_date)]

    master = (filter_df.groupby(['Customer', 'Category', 'Sub-Category', 'Product', 
                     pd.Grouper(key='Date',freq='A')])['Units_Sold']
                .sum()
                .unstack()
              )

    y = master.groupby(level=mylevels[:-1]).sum()
    y.index = pd.MultiIndex.from_arrays([
        y.index.get_level_values(0),
        y.index.get_level_values(1),
        y.index.get_level_values(2) + ' Total',
        len(y.index)*['']
    ])

    y1 = master.groupby(level=mylevels[0:2]).sum()
    y1.index = pd.MultiIndex.from_arrays([
        y1.index.get_level_values(0),
        y1.index.get_level_values(1)+ ' Total',
        len(y1.index)*[''],
        len(y1.index)*['']
    ])

    y2 = master.groupby(level=mylevels[0]).sum()
    y2.index = pd.MultiIndex.from_arrays([
        y2.index.get_level_values(0)+ ' Total',
        len(y2.index)*[''],
        len(y2.index)*[''],
        len(y2.index)*['']
    ])

    final_df = (pd.concat([master,y,y1,y2])
                         .sort_index()
                         .assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])
                         .assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)
                         .dropna(how='all')
                         .reorder_levels(mylevels)
                )

    return final_df

Aggregation Runs (of different levels and date ranges)

agg_df1 = multiple_agg([0,1,2,3])

agg_df2 = multiple_agg([1,3,0,2], '2016-01-01', '2017-12-31')

agg_df3 = multiple_agg([2,3,1,0], start_date='2017-01-01', end_date='2018-12-31')

Testing (final_df being OP'S pd.concat() output)

# EQUALITY TESTING OF FIRST 10 ROWS
print(final_df.head(10).eq(agg_df1.head(10)))

# Date                                        2016-12-31 00:00:00  2017-12-31 00:00:00  2018-12-31 00:00:00  Diff  Diff_Perc
# Customer   Category Sub-Category Product                                                                                  
# 45mhn4PU1O Group A  X            Product 1                 True                 True                 True  True       True
#                                  Product 2                 True                 True                 True  True       True
#                                  Product 3                 True                 True                 True  True       True
#                     X Total                                True                 True                 True  True       True
#                     Y            Product 1                 True                 True                 True  True       True
#                                  Product 2                 True                 True                 True  True       True
#                                  Product 3                 True                 True                 True  True       True
#                     Y Total                                True                 True                 True  True       True
#                     Z            Product 1                 True                 True                 True  True       True
#                                  Product 2                 True                 True                 True  True       True

Thank you so much for your help! I'm trying to better understand the function - when I run the different aggregations, agg_df1 returns what I'd expect, including the sub-totals. When I run aagg_df2 & agg_df3 it doesn't return what I'd expect - the order stays the same as in agg_df1 even though you're passing different levels? Also, the subtotals part is missing.Finally, I can't seem to get your Testing bit to work. Thanks so much for your help! — keg5038, Sep 06 '18 at 17:40
The other two agg_dfs are demonstrating how you can change parameters as you indicated you wanted. So no, they will not match anything you posted. Adjust to your specific needs. You need to assign your `pd.concat()` output to a *final_df* variable. And be sure to add `np.random.seed(####)` at top to reproduce same random numbers. — Parfait, Sep 06 '18 at 18:00
Thanks again! Sorry for being unclear, what I meant is that running 'multiple_agg([0,1,2])' or 'multiple_agg([1,3,0], '2016-01-01', '2017-12-31')' or 'multiple_agg([2,3,1], start_date='2017-01-01', end_date='2018-12-31')' return the same multi-index (Customer, then Category, Sub-Category, then Product'. I was thinking that the multi-index order would change based on what orders the levels were passed in. — keg5038, Sep 06 '18 at 18:56
I see. It comes down to the `pd.concat` at end of function. Since *master* is the first item, it appends according to its index order. However, you can use [reorder_levels](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reorder_levels.html) using *mylevels* parameter. But you must now pass a list of 4 numbers where last one is never used in aggregation. See edit. — Parfait, Sep 06 '18 at 19:30
Just did some quick tests, but this is amazing!! Thank you so much for your help, it is greatly appreciated. I will report back if I come across anything strange, but this appears to work great. Thanks so much! — keg5038, Sep 06 '18 at 20:03

score 0 · Answer 2 · answered Sep 06 '18 at 17:15

I think you can do it using sum with the level parameter:

master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()
s1 = master.sum(level=[0,1,2]).assign(Product='Total').set_index('Product',append=True)
s2 = master.sum(level=[0,1])

# Wanted to use assign method but because of the hyphen in the column name you can't.
# Also use the Z in front for sorting purposes
s2['Sub-Category'] = 'ZTotal'
s2['Product'] = ''
s2 = s2.set_index(['Sub-Category','Product'], append=True)

s3 = master.sum(level=[0])
s3['Category'] = 'Total'
s3['Sub-Category'] = ''
s3['Product'] = ''
s3 = s3.set_index(['Category','Sub-Category','Product'], append=True)

master_new = pd.concat([master,s1,s2,s3]).sort_index()
master_new

Output:

Date                                        2016-12-31  2017-12-31  2018-12-31
Customer   Category Sub-Category Product                                      
30XWmt1jm0 Group A  X            Product 1       651.0       341.0       453.0
                                 Product 2       267.0       445.0       117.0
                                 Product 3       186.0       280.0       352.0
                                 Total          1104.0      1066.0       922.0
                    Y            Product 1       426.0       417.0       670.0
                                 Product 2       362.0       210.0       380.0
                                 Product 3       232.0       290.0       430.0
                                 Total          1020.0       917.0      1480.0
                    Z            Product 1       196.0       212.0       703.0
                                 Product 2       277.0       340.0       579.0
                                 Product 3       416.0       392.0       259.0
                                 Total           889.0       944.0      1541.0
                    ZTotal                      3013.0      2927.0      3943.0
           Group B  X            Product 1       356.0       230.0       407.0
                                 Product 2       402.0       370.0       590.0
                                 Product 3       262.0       381.0       377.0
                                 Total          1020.0       981.0      1374.0
                    Y            Product 1       575.0       314.0       643.0
                                 Product 2       557.0       375.0       411.0
                                 Product 3       344.0       246.0       280.0
                                 Total          1476.0       935.0      1334.0
                    Z            Product 1       278.0       152.0       392.0
                                 Product 2       149.0       596.0       303.0
                                 Product 3       234.0       505.0       521.0
                                 Total           661.0      1253.0      1216.0
                    ZTotal                      3157.0      3169.0      3924.0
           Total                                6170.0      6096.0      7867.0
3U2anYOD6o Group A  X            Product 1       214.0       443.0       195.0
                                 Product 2       170.0       220.0       423.0
                                 Product 3       111.0       469.0       369.0
...                                                ...         ...         ...
somc22Y2Hi Group B  Z            Total           906.0      1063.0       680.0
                    ZTotal                      3070.0      3751.0      2736.0
           Total                                6435.0      7187.0      6474.0
zRZq6MSKuS Group A  X            Product 1       421.0       182.0       387.0
                                 Product 2       359.0       287.0       331.0
                                 Product 3       232.0       394.0       279.0
                                 Total          1012.0       863.0       997.0
                    Y            Product 1       245.0       366.0       111.0
                                 Product 2       377.0       148.0       239.0
                                 Product 3       372.0       219.0       310.0
                                 Total           994.0       733.0       660.0
                    Z            Product 1       280.0       363.0       354.0
                                 Product 2       384.0       604.0       178.0
                                 Product 3       219.0       462.0       366.0
                                 Total           883.0      1429.0       898.0
                    ZTotal                      2889.0      3025.0      2555.0
           Group B  X            Product 1       466.0       413.0       187.0
                                 Product 2       502.0       370.0       368.0
                                 Product 3       745.0       480.0       318.0
                                 Total          1713.0      1263.0       873.0
                    Y            Product 1       218.0       226.0       385.0
                                 Product 2       123.0       382.0       570.0
                                 Product 3       173.0       572.0       327.0
                                 Total           514.0      1180.0      1282.0
                    Z            Product 1       480.0       317.0       604.0
                                 Product 2       256.0       215.0       572.0
                                 Product 3       463.0        50.0       349.0
                                 Total          1199.0       582.0      1525.0
                    ZTotal                      3426.0      3025.0      3680.0
           Total                                6315.0      6050.0      6235.0

[675 rows x 3 columns]

Thanks, Scott. This is a great solution that I will use going forward for the creation of subtotals. I'm hoping to turn it into a function like solution that I can pass levels & dates to - any suggestions there? — keg5038, Sep 06 '18 at 17:43

Pythonic / Panda Way to Create Function to Groupby

2 Answers2

Linked