2

I'm trying to calculate moving average on a bitemporal dataset. The dataset consists of a data date and an effective date(the date at which the data became available). The data for this date could be restated several times in future(same data date but a different effective date). I need to calculate a moving average for the past 4 quarters using the data that is valid for the effective date of the row being calculated.

The dataset looks like this

id datadate effdate value
1 2005-03-31 2005-04-15 10
1 2005-03-31 2005-05-30 11
1 2005-06-30 2005-07-15 9
1 2005-06-30 2005-08-20 9.5
1 2005-06-30 2005-10-15 9.6
1 2005-09-30 2005-10-15 10.5
1 2005-09-30 2005-11-10 11
1 2005-09-30 2006-02-20 10.75
1 2005-12-31 2006-02-13 12
1 2005-12-31 2006-02-20 11.6
1 2005-12-31 2006-05-10 11
1 2006-03-31 2006-04-20 8
1 2006-03-31 2006-05-10 8.25

The result should be

id datadate effdate Value MAvg
0 1 2005-03-31 2005-04-15 10 10
1 1 2005-03-31 2005-05-30 11 11
2 1 2005-06-30 2005-07-15 9 10
3 1 2005-06-30 2005-08-20 9.5 10.25
4 1 2005-06-30 2005-10-15 9.6 10.30
5 1 2005-09-30 2005-10-15 10.5 10.37
6 1 2005-09-30 2005-11-10 11 10.53
7 1 2005-09-30 2006-02-20 10.75 10.45
8 1 2005-12-31 2006-02-13 12 10.9
9 1 2005-12-31 2006-02-20 11.5 10.71
10 1 2005-12-31 2006-05-10 11 10.59
11 1 2006-03-31 2006-04-20 8 9.96
12 1 2006-03-31 2006-05-10 8.25 9.9

I'm doing this in python using pandas. The way I'm doing this is by joining the dataframe with itself on id and previous 4 quarters and calculating new effdates for all periods based on the effdates of the past 4 quarters, then I join once again with id, datadate and effdate and calculate the average.

keys["id"]
calc_df = df1.merge(df2, on=keys, how='left')
calc_df = calc_df.loc[
            (calc_df["datadate_x"] >= calc_df["datadate_y"])
            & (calc_df["datadate_y"] >= calc_df["datadate_x"] - pd.tseries.offsets.MonthEnd(n=9))
            & (calc_df["effdate_x"] <= calc_df["thrudate_y"])
            & (calc_df["thrudate_x"] >= calc_df["effdate_y"])
        ]
calc_df = calc_df.drop_duplicates().reset_index(drop=True)
grp_keys = keys + ["datadate_x"]
calc_df["effdate"] = calc_df[["effdate_x", "effdate_y"]].max(axis=1)
calc_df = calc_df.sort_values(grp_keys + ["effdate"]).drop_duplicates(
            subset=grp_keys + ["effdate"], keep="first"
        )
calc_df = calc_df['id', 'datadate_x', 'effdate', 'value']

calc_df = calc_df.merge(df1, on=["id"], how="left")
calc_df = calc_df.loc[
            (calc_df["datadate_x"] >= calc_df["datadate"])
            & (
                calc_df["datadate"]
                >= calc_df["datadate_x"] - pd.tseries.offsets.MonthEnd(n=9)
            )
            & (calc_df["effdate_x"] <= calc_df["thrudate_y"])
            & (calc_df["thrudate_x"] >= calc_df["effdate_y"])
        ]

        
calc_df["MAvg"] = calc_df.groupby(["id", "datadate_x", "effdate_x"])["value"].transform(
            lambda s: s.mean(skipna=False)
        )

This works but its very slow when I run it on the full dataset which has around 2000 differebt ids and datadate starting from 2000 to most recent quarter(around 500K rows) and I have to calcualate the moving averages on several fields for different windows. So I want to see if there is a efficient way of doing this.

Satish
  • 3,020
  • 7
  • 35
  • 47

1 Answers1

2

It's unclear what would constitute as an answer to this question, since you've only asked for something "better" and you've not indicated which axis you'd like this improvement to be on.

So I'll provide you something that is better, in terms of readability. This might help other find improvements in other directions:

Given the csv data.csv:

datadate,effdate,value
2005-03-31,2005-04-15,10
2005-03-31,2005-05-30,11
2005-06-30,2005-07-15,9
2005-06-30,2005-08-20,9.5
2005-06-30,2005-10-15,9.6
2005-09-30,2005-10-15,10.5
2005-09-30,2005-11-10,11
2005-09-30,2006-02-20,10.75
2005-12-31,2006-02-13,12
2005-12-31,2006-02-20,11.5
2005-12-31,2006-05-10,11
2006-03-31,2006-04-20,8
2006-03-31,2006-05-10,8.25

You can achieve the same outcome by doing the following:

import pandas as pd

data = pd.read_csv('data.csv').sort_values(by=["effdate"])


def moving_avg(df, quarters):
    df = df[
        df.index <= df.last_valid_index
    ].drop_duplicates(subset="datadate", keep="last").tail(quarters)
    return df["value"].sum() / df["value"].count()


print(data.assign(MAvg=[moving_avg(d, 4) for d in data.expanding()]).sort_index())

This provides:

      datadate     effdate  value       MAvg
0   2005-03-31  2005-04-15  10.00  10.000000
1   2005-03-31  2005-05-30  11.00  11.000000
2   2005-06-30  2005-07-15   9.00  10.000000
3   2005-06-30  2005-08-20   9.50  10.250000
4   2005-06-30  2005-10-15   9.60  10.300000
5   2005-09-30  2005-10-15  10.50  10.366667
6   2005-09-30  2005-11-10  11.00  10.533333
7   2005-09-30  2006-02-20  10.75  10.450000
8   2005-12-31  2006-02-13  12.00  10.900000
9   2005-12-31  2006-02-20  11.50  10.712500
10  2005-12-31  2006-05-10  11.00  10.587500
11  2006-03-31  2006-04-20   8.00   9.962500
12  2006-03-31  2006-05-10   8.25   9.900000
foxyblue
  • 2,859
  • 2
  • 21
  • 29
  • I updated the dataset for more clarity and why I'm looking for a different solution. As you can see taking the last row for datadate does not work. It has to be the most recent effdate that is valid for the current row's effdate. I hope that clarifies it. Thanks. – Satish Jan 28 '21 at 16:21
  • @Satish I think I'm confused, if `effdate` is sorted from oldest to newest, the last one will be the most recent. If you want effdate to determine the date at which you look back. First sort by `effdate`, then `datadate`. I'm slightly confused how you got your results. – foxyblue Jan 28 '21 at 17:28
  • 1
    For each row you have to get the average of values of 4 quarters where the effdate is closest but not greater than the effdate of the current row. For row 12 the MAvg would mean of rows 12,10(not 11),8,4. – Satish Jan 28 '21 at 18:09
  • Why does MAvg of 12 not include row 1? `effdate` of 1 is before 12. – foxyblue Jan 28 '21 at 18:22
  • Its a rolling mean of 4 quarters (current quarter and past 3 quarters) – Satish Jan 28 '21 at 18:23
  • Ahh ok, I understand you now! – foxyblue Jan 28 '21 at 18:23
  • @Satish I have updated my answer, I still don't think I have it 100% – foxyblue Jan 28 '21 at 19:06
  • 1
    Perhaps you can explain 10.43 for row 7. I believe it to be 7,4,1 (10.75,9.6,11)/3=10.45 – foxyblue Jan 28 '21 at 19:18
  • 1
    Do you want to double check your row 8 too? :) I get 8 = {8, 6, 4, 1) (12,11,9.6,11)/4 = 10.9 – foxyblue Jan 28 '21 at 19:47