0

I am trying to convert my 15ys worth of daily data into weekly by taking the mean, diff and count of certain features. I tried using .resample but I was not sure if that is the most efficient way.

My sample data:

Date,Product,New Quantity,Price,Refund Flag 8/16/1994,abc,10,0.5, 8/17/1994,abc,11,0.9,1 8/18/1994,abc,15,0.6, 8/19/1994,abc,19,0.4, 8/22/1994,abc,22,0.2,1 8/23/1994,abc,19,0.1, 8/16/1994,xyz,16,0.5,1 8/17/1994,xyz,10,0.9,1 8/18/1994,xyz,12,0.6,1 8/19/1994,xyz,19,0.4, 8/22/1994,xyz,26,0.2,1 8/23/1994,xyz,30,0.1, 8/16/1994,pqr,0,0, 8/17/1994,pqr,0,0, 8/18/1994,pqr,1,1, 8/19/1994,pqr,2,0.6, 8/22/1994,pqr,9,0.1, 8/23/1994,pqr,12,0.2,

This is the output I am looking for: Date,Product,Net_Quantity_diff,Price_avg,Refund 8/16/1994,abc,9,0.6,1 8/22/1994,abc,-3,0.15,0 8/16/1994,xyz,3,0.6,3 8/22/1994,xyz,4,0.15,1 8/16/1994,pqr,2,0.4,0 8/22/1994,pqr,3,0.15,0

Ksh
  • 85
  • 4

1 Answers1

0

I think the pandas resample method is indeed ideal for this. You can pass a dictionary to the agg method, defining which aggregation function to use for each column. For example:

import numpy as np
import pandas as pd

df = pd.read_csv('sales.txt')  # your sample data
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(df['Date'])
del df['Date']
df['Refund Flag'] = df['Refund Flag'].fillna(0).astype(bool)

def span(s):
    return np.max(s) - np.min(s)

df_weekly = df.resample('w').agg({'New Quantity': span,
                                  'Price': np.mean,
                                  'Refund Flag': np.sum})
df_weekly
            New Quantity  Price     Refund Flag
Date            
1994-08-21  19            0.533333  4
1994-08-28  21            0.150000  2
Arne
  • 9,990
  • 2
  • 18
  • 28
  • I am trying to do something similar using resample but when I run it on a groupby df, it takes a long time. I am trying to run it on 40 columns. – Ksh Oct 01 '21 at 20:12
  • How does `groupby` enter into this? Anyway, assuming the data are sorted by date, you could try dividing the data into batches (making sure not to cut within weeks), resampling each batch, and then combining the results. – Arne Oct 01 '21 at 21:04