7

I quite often write a function to return different dataframes based on the parameters I enter. Here's an example dataframe:

np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',  
                      freq='M'), 10000)})

I then created a function to perform sub-totals for me like this:

def some_fun(DF1, agg_column, myList=[], *args):
    y = pd.concat([
    DF1.assign(**{x:'[Total]' for x in myList[i:]})\
            .groupby(myList).agg(sumz = (agg_column,'sum')) for i in range(1,len(myList)+1)]).sort_index().unstack(0)
    return y

I then write out lists that I'll pass as arguments to the function:

list_one = [pd.Grouper(key='Date',freq='A'),'Category','Product']
list_two = [pd.Grouper(key='Date',freq='A'),'Category','Sub-Category','Sub-Category-2']
list_three = [pd.Grouper(key='Date',freq='A'),'Sub-Category','Product']

I then have to run each list through my function creating new dataframes:

df1 = some_fun(df,'Units_Sold',list_one)
df2 = some_fun(df,'Dollars_Sold',list_two)
df3 = some_fun(df,'Units_Sold',list_three)

I then use a function to write each of these dataframes to an Excel worksheet. This is just an example - I perform this same exercise 10+ times.

My question - is there a better way to perform this task than to write out df1, df2, df3 with the function information applied? Should I be looking at using a dictionary or some other data type to do this my pythonically with a function?

keg5038
  • 341
  • 3
  • 13
  • are all the lists for the individual data frames unique? – Umar.H Jan 12 '20 at 03:51
  • Hi! There could be duplicate lists passed to the function, but I’d use a different agg_column in the function call. If it’s easier, I’d also be fine to change the function to put the agg_column inside the list. Thanks! – keg5038 Jan 12 '20 at 16:52
  • @keg5038 Does your example code work on your end? When I try to go through thse steps, `df1 = some_fun(df,'Units_Sold',list_one)` only returns an error `TypeError: aggregate() missing 1 required positional argument: 'arg'` after the step `DF1.assign(**{x:'[Total]' for x in myList[i:]}).groupby(myList).agg(sumz = (agg_column,'sum')) for i in range(1,len(myList)+1)]).sort_index().unstack(0)` – vestland Jan 15 '20 at 16:15
  • @vestland I should have been more explicit - this code is using Pandas 0.25 which allows a Groupby aggregation with relabeling (https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling). For older versions of Pandas, you could alter the function to have .agg({agg_column:'sum'}) instead of .agg(sumz = (agg_column,'sum')) – keg5038 Jan 15 '20 at 19:16
  • It's quite hard to say but my guess is to use a nested dictionary for both the `agg_column` and the `pd.Grouper` calls. this should get you very close. other than that maybe a few classes or well defined functions? it depends on how many variances of the `lists` there are as well – Umar.H Jan 19 '20 at 17:19

2 Answers2

3

A dictionary would be my first choice:

variations = ([('Units Sold', list_one), ('Dollars_Sold',list_two), 
              ..., ('Title', some_list)])

df_variations = {}

for i, v in enumerate(variations):
     name = v[0]
     data = v[1]
     df_variations[i] = some_fun(df, name, data)

You might further consider setting the keys to unique / helpful titles for the variations, that goes beyond something like 'Units Sold', which isn't unique in your case.

1

IIUC,

as Thomas has suggested we can use a dictionary to parse through your data, but with some minor modifications to your function, we can use the dictionary to hold all the required data then pass that through to your function.

the idea is to pass two types of keys, the list of columns and the arguments to your pd.Grouper call.

data_dict = {
    "Units_Sold": {"key": "Date", "freq": "A"},
    "Dollars_Sold": {"key": "Date", "freq": "A"},
    "col_list_1": ["Category", "Product"],
    "col_list_2": ["Category", "Sub-Category", "Sub-Category-2"],
    "col_list_3": ["Sub-Category", "Product"],
}

def some_fun(dataframe, agg_col, dictionary,column_list, *args):

    key = dictionary[agg_col]["key"]

    frequency = dictionary[agg_col]["freq"]

    myList = [pd.Grouper(key=key, freq=frequency), *dictionary[column_list]]

    y = (
        pd.concat(
            [
                dataframe.assign(**{x: "[Total]" for x in myList[i:]})
                .groupby(myList)
                .agg(sumz=(agg_col, "sum"))
                for i in range(1, len(myList) + 1)
            ]
        )
        .sort_index()
        .unstack(0)
    )
    return y

Test.

df1 = some_fun(df,'Units_Sold',data_dict,'col_list_3')
print(df1)
                                 sumz                      
Date                   2016-12-31 2017-12-31 2018-12-31
Sub-Category Product                                   
X            Product 1      18308      17839      18776
             Product 2      18067      19309      18077
             Product 3      17943      19121      17675
             [Total]        54318      56269      54528
Y            Product 1      20699      18593      18103
             Product 2      18642      19712      17122
             Product 3      17701      19263      20123
             [Total]        57042      57568      55348
Z            Product 1      19077      17401      19138
             Product 2      17207      21434      18817
             Product 3      18405      17300      17462
             [Total]        54689      56135      55417
[Total]      [Total]       166049     169972     165293

as you want to automate the writing of the 10x worksheets, we can again do that with a dictionary call over your function:

matches = {'Units_Sold': ['col_list_1','col_list_3'],
          'Dollars_Sold' : ['col_list_2']}

then a simple for loop to write all the files to a single excel sheet, change this to match your required behavior.

writer = pd.ExcelWriter('finished_excel_file.xlsx')
for key,value in matches.items():
    for items in value:        
        dataframe = some_fun(df,k,data_dict,items)
        dataframe.to_excel(writer,f'{key}_{items}')
writer.save()
Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • thank you so much! This is brilliant & I'm sorry for the delay in responding - I needed the opportunity to test it out as it was over my head initially. So far it has worked great & I will continue to test it. I am piecing together the logic on it but sincerely appreciate your thoughtful response! Thanks! – keg5038 Jan 23 '20 at 03:44
  • no problem, if there is a specific part where you need help I'll be happy to oblige @keg5038 – Umar.H Jan 23 '20 at 08:53