1

I have a dataset like this (example purpose)

df = pd.DataFrame({
    'Store' : [100, 100, 100, 100, 101, 101, 101, 101],
    'Product' : [5, 3, 10, 1, 3, 11, 2, 5],
    'Category' : ['A', 'B', 'C', 'A', 'B', 'A', 'C', 'A'],
    'Sales' : [100, 235, 120, 56, 789, 230, 300, 35]
})

So it is like

Store   Product Category    Sales
100      5       A           100
100      3       B           235
100      10      C           120
100      1       A           56
101      3       B           789
101      11      A           230
101      2       C           300
101      5       A           35

Each store has some products and each product has some category. I need to find out the total sales for each store and percentage sales for each category in each stores. So result has to be something like this:

         total_Sales    Category_A  Category_B  Category_C
Store               
100       511            30.528376   45.988258   23.483366
101       1354           19.571640   58.271787   22.156573

(The category columns are in %)

Currently I am doing it like this:

df1 = df.groupby(['Store']).apply(lambda x: x['Sales'].sum())
df1 = df1.to_frame()
df1 = df1.rename(columns={0 : 'Sales'})

def category_util(x, col, cat):
    total_sales = x['Sales'].sum()
    cat_sales = x[x[col] == cat]['Sales'].sum()
    if cat_sales == 0:
        return 0
    else:
        return cat_sales*100/total_sales
df1['Category_A'] = df.groupby(['Store']).apply(lambda x: category_util(x, 'Category', 'A'))
df1['Category_B'] = df.groupby(['Store']).apply(lambda x: category_util(x, 'Category', 'B'))
df1['Category_C'] = df.groupby(['Store']).apply(lambda x: category_util(x, 'Category', 'C'))

df1 is the desired output. It is working fine, but every apply function is sorting the grouping columns again and again and for a big dataset, it is highly time consuming. I want to do this in one function call. I tried something like:

df.groupby(['Store']).agg([lambda x: category_util(x, 'Category', 'A'),
                          lambda x: category_util(x, 'Category', 'B'),
                          lambda x: category_util(x, 'Category', 'C')])

But it fails with a KeyError for 'Sales`

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: 'Sales'

Any solution for this? Is there any way to use apply function with an array of lambda functions and calculate all the columns in one go? If it is not possible with apply, is it possible to do with agg? It will really save me a lot of time. Thank you in advance.

Ricky
  • 635
  • 2
  • 5
  • 20

3 Answers3

2

you can use pivot_table and unstack

table = df.pivot_table(index=['Store', 'Category'], values=['Sales'], aggfunc='sum')#.unstack().add_prefix('Category_')
t_sales = table.sum(level=0)
table=table.div(table.sum(level=0)).mul(100).unstack().add_prefix('Category_')
table.assign(total_sales=t_sales).reset_index()

    Store   Category_Sales                          total_sales
Category    Category_A  Category_B  Category_C  
0   100     30.528376   45.988258   23.483366       511
1   101     19.571640   58.271787   22.156573       1354

Pygirl
  • 12,969
  • 5
  • 30
  • 43
2

We can use groupby with unstack. Then we divide the sum over the axis=1:

dfn = df.groupby(['Store', 'Category'])['Sales'].sum().unstack(level=1)
total_sales = dfn.sum(axis=1)
dfn = (
    dfn.div(total_sales, axis=0)
    .mul(100)
    .add_prefix("Category_")
    .assign(total_sales=total_sales)
).rename_axis(columns=None)
       Category_A  Category_B  Category_C  total_sales
Store                                                 
100     30.528376   45.988258   23.483366          511
101     19.571640   58.271787   22.156573         1354
Erfan
  • 40,971
  • 8
  • 66
  • 78
2

We can create two groupby objects (relatively cheap operations), and pipe a function that returns a dataframe, containing the total sum, and percentages:

group1 = df.groupby('Store')

group2 = df.groupby(['Store', 'Category'])

(df.assign(total_sales = group1.Sales.transform('sum'))
.groupby(['Store','Category'])
.pipe(lambda df: pd.DataFrame({"res" :df.Sales.sum()
                                        .div(df.total_sales.max())
                                        .mul(100), 
                               "total_sales": df.total_sales.max()}))
.set_index('total_sales', append = True)
.unstack('Category')
.droplevel(0, axis=1)
.add_prefix('Category_')
.rename_axis(columns=None)
.reset_index()
)


   Store  total_sales  Category_A  Category_B  Category_C
0    100          511   30.528376   45.988258   23.483366
1    101         1354   19.571640   58.271787   22.156573
sammywemmy
  • 27,093
  • 4
  • 17
  • 31