pandas: efficient way to make calculations within groups

Question

I am working with transactional data like this:

Purchase data        | Buyer | Product | Quantity
--------------------------------------------------
2013-01-01 13:00:00  | Carl  | A       | 5
2013-01-01 13:05:00  | Mark  | B       | 2
2013-01-01 20:00:00  | Carl  | A       | 5
2013-01-02 10:00:00  | Joe   | A       | 10
2013-01-02 14:00:00  | Carl  | A       | 5

Now I would like to group by the purchasing date, the buyer and product to calculate the following questions:

How long have the customers been in the department each day (time first purchase - time last purchase)? In case of only let us assume 1h.
Which is the daily top seller products until noon and until midnight?
How can I calculate intergroup statistics such as the average purchase quantity without using avg but using a self defined function ?

Update

Is there also a possibility to iterate over the columns within groups e.g. I want to calculate a column which displays the quantity differences between Carl and all other buyers. It would look like this
```
Date       | Buyer | Difference
-------------------------------
2013-01-01 | Carl  | 0
2013-01-01 | Mark  | -3
2013-01-01 | Carl  | 0
2013-01-01 | Joe   | 5
```
Moreover are there days on which no purchase occurred?

I would be very thankful for your help thanks

Andy

Hi Andy, you are right. I just accepted your answers to my other questions. Sorry I am new to stackoverflow. — Andy, May 17 '13 at 13:54

unutbu · Answer 1 · 2013-05-17T18:21:27.653

Given this setup:

import pandas as pd
import datetime as DT
df = pd.DataFrame({
    'Buyer': 'Carl Mark Carl Joe Joe Carl'.split(),
    'Product': list('ABAABA'),
    'Quantity': [5,2,5,10,1,5]
    }, index=[
        DT.datetime(2013,1,1,13,0),
        DT.datetime(2013,1,1,13,5),
        DT.datetime(2013,1,1,20,0),
        DT.datetime(2013,1,2,10,0),
        DT.datetime(2013,1,2,12,0),                                      
        DT.datetime(2013,1,2,14,0),
        ])

print(df)
#                     Buyer Product  Quantity
# 2013-01-01 13:00:00  Carl       A         5
# 2013-01-01 13:05:00  Mark       B         2
# 2013-01-01 20:00:00  Carl       A         5
# 2013-01-02 10:00:00   Joe       A        10
# 2013-01-02 12:00:00   Joe       B         1
# 2013-01-02 14:00:00  Carl       A         5

How long have the customers been in the department each day (time first purchase - time last purchase)? In case of only let us assume 1h.

def lingertime(df):
    dates = df.index.map(lambda d: d.date())
    def linger(grp):
        dates = grp.index
        x = (dates.max()-dates.min())
        return x or DT.timedelta(hours=1)
    return df.groupby([dates, 'Buyer']).apply(linger)

print(lingertime(df))
# date        Buyer
# 2013-01-01  Carl     7:00:00
#             Mark     1:00:00
# 2013-01-02  Carl     1:00:00
#             Joe      2:00:00

Which is the daily top seller products until noon and until midnight?

def product_quantity(df, from_hour, to_hour):
    df_timeslice = df.ix[
        df.index.indexer_between_time(
            DT.time(from_hour), DT.time(to_hour),
            include_start=True, include_end=False)]
    # print(df_timeslice)
    #                     Buyer Product  Quantity
    # 2013-01-02 10:00:00   Joe       A        10
    # 2013-01-02 12:00:00   Joe       B         1
    return df_timeslice.groupby('Product').sum().sort(['Quantity'], ascending=False)

print(product_quantity(df, 0, 12))
#          Quantity
# Product          
# A              10

print(product_quantity(df, 12, 0))
#          Quantity
# Product          
# A              15
# B               3

How can I calculate intergroup statistics such as the average purchase quantity without using avg but using a self defined function ?

def average_quantity_per_product(df):
    def myavg(grp):
        return grp['Quantity'].mean()
    return df.groupby('Product').apply(myavg)
print(average_quantity_per_product(df))
# Product
# A          6.25
# B          1.50

To compare one Buyer with other Buyers grouped by day:

def compare_buyers_with(df, name):
    def compare(grp):
        groups = grp.groupby('Buyer')
        total = groups['Quantity'].sum()
        return total-total.get(name, 0)
    dates = df.index.map(lambda d: d.date())
    return df.groupby([dates]).apply(compare)
print(compare_buyers_with(df, 'Carl'))
#             Buyer
# 2013-01-01  Carl     0
#             Mark    -8
# 2013-01-02  Carl     0
#             Joe      6
# Name: Quantity

To find days when a product has not been sold:

def days_when_not_sold(df, name):
    dates = df.index.map(lambda d: d.date())
    def not_in(grp):
        return not np.any(name == grp['Product'])
    sales = df.groupby([dates]).apply(not_in)
    return sales.index.values[sales]
print(days_when_not_sold(df, 'A'))
# []
print(days_when_not_sold(df, 'C'))
# [2013-01-01 2013-01-02]

Hi unutbu, thanks a lot. Great job! One more thing, is there also a possibility to iterate over the columns within groups e.g. I want to calculate a column which displays the quantity differences between Carl and all other buyers — Andy, May 17 '13 at 14:32
I've added some code which shows how one *could* iterate over the columns of a group, but I'm rather confused about what you want to do with it. If you show an example of what you want calculated and include the expected result, we may be able to suggest a better way to calculate it. — unutbu, May 17 '13 at 14:52
Andy: I'm not entirely confident I have the best method for tackling each of your questions. If you post them each as a separate question, you may get more responses, and then you'll get to pick the best of the litter. — unutbu, May 17 '13 at 18:25

pandas: efficient way to make calculations within groups

1 Answers1