1

I am working on a transactions dataset and I need to figure out the average time between purchases for each client.

I managed to get the diff between the latest date and the earliest (in months) and divide by the total purchases (NumPurchases). But I am not sure of this approach as it does not take into consideration the fact that not every only bought on multiple occasions.

Imagine the following dataset how would you extract the average time between purchases.

    CustomerID  EarliestSale    LatestSale      NumPurchases    
0   1             2017-01-05    2017-12-23                11    
1   10            2017-06-20    2017-11-17                 5    
2   100           2017-05-10    2017-12-19                 2    
3   1000          2017-02-19    2017-12-30                 9    
4   1001          2017-02-07    2017-11-18                 7    

Apologies for the rookie question in advance and thanks StackOverflow community :).

Youssef Razak
  • 365
  • 4
  • 11
  • I don't this is an implementation question. Like, how do you do this in `pandas` etc. It sounds more like a data limitation. If you only have the `EarliestSale` and the `LatestSale` for a set of observations then you can't work out the average. Do you have the actual purchase data to work with, rather than the aggregated data? You could get the average from the purchase data and at it as a field in the customer data. – robertwest Nov 11 '20 at 14:57
  • Can't you simply make an exception for all customers with < 2 sales? Also, divide by (number of purchases - 1). Otherwise the time will be half for 2 sales – hasleron Nov 11 '20 at 14:59
  • @Robert that table is the result of aggregation on a groupby (custumer_id). The transaction table has individual records for sales – Youssef Razak Nov 11 '20 at 15:06
  • Ok, you should get your average values from the dataset before the group by. Calculate the diff on that table. Then add the average of the diff in the groupby. I can be more specific if you share the original dataset. Ideally in code we can run your code. Try to follow these guidelines to make it easier for us to help you: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – robertwest Nov 11 '20 at 15:10
  • df = pd.DataFrame({'CustomerId': ['001', '001', '001'], 'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10'], 'Quantity': ['5', '1', '1']}) – Youssef Razak Nov 11 '20 at 15:17
  • I think thats it. one customer multiple purchases on different dates. I need to figure out the average time between transactions – Youssef Razak Nov 11 '20 at 15:18

1 Answers1

4

Given your revised question and initial dataset (I've revised your dataset slightly to include two customers):

df = pd.DataFrame({'CustomerId': ['001', '001', '002', '002'], 
                   'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10', '2017-09-10'], 
                   'Quantity': [5, 1, 1, 6]})


You can easily include the average time between transactions (in days) in your group by with the following code:

NOTE: This will only work if you dataset is ordered by CustomerID then SaleDate.

import pandas as pd
df = pd.DataFrame({'CustomerId': ['001', '001', '002', '002'], 
                   'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10', '2017-09-10'], 
                   'Quantity': ['5', '1', '1', '6']})

# convert the string date to a datetime 
df['SaleDate'] = pd.to_datetime(df.SaleDate)

# sort the dataset
df = df.sort_values(['CustomerId', 'SaleDate'])

# calculate the difference between each date in days 
# (the .shift method will offset the rows, play around with this to understand how it works
# - We apply this to every customer using a groupby 
df2 = df.groupby("CustomerId").apply(
    lambda df: (df.SaleDate - df.SaleDate.shift(1)).dt.days).reset_index()
df2 = df2.rename(columns={'SaleDate': 'time-between-sales'})
df2.index = df2.level_1

# then join the result back on to the original dataframe
df = df.join(df2['time-between-sales'])

# add the mean time to your groupby 
grouped = df.groupby("CustomerId").agg({
    "SaleDate": ["min", "max"], 
    "Quantity": "sum", 
    "time-between-sales": "mean"})

# rename columns per your original specification 
grouped.columns = grouped.columns.get_level_values(0) + grouped.columns.get_level_values(1) 
grouped = grouped.rename(columns={
    'SaleDatemin': 'EarliestSale',
    'SaleDatemax': 'LatestSale',
    'Quantitysum': 'NumPurchases',
    'time-between-salesmean': 'avgTimeBetweenPurchases'})
print(grouped)

           EarliestSale LatestSale NumPurchases  avgTimeBetweenPurchases
CustomerId
001          2017-01-10 2017-04-10            6                     90.0
002          2017-08-10 2017-09-10            7                     31.0
robertwest
  • 904
  • 7
  • 13