I have data in a pyspark dataframe like this over the following few years. The week_start_dt is when I kicked off a prediction. And the start_month are the previous 12 months.
+--------------------+------------------+----------------------+----------------+
| start_month | week_start_dt| predictions| actuals |
+--------------------+------------------+----------------------+----------------+
| 2019-01| 2019-11-11| 12| 11|
| 2018-12| 2019-11-11| 13| 11|
| 2019-08| 2019-11-11| 9| 11|
| 2019-11| 2019-11-11| 12| 11|
| 2019-11| 2019-11-11| 1970| 1440|
| 2019-11| 2019-11-11| 478| 501|
+--------------------+------------------+----------------------+----------------+
I would like to calculate the RMSE with a groupby
on the start_month
and week_start_dt
. I think it would require a User Defined Aggregate Function.
Something along the lines of this in pandas: Python Dataframe: Calculating R^2 and RMSE Using Groupby on One Column
I use the following code to just get rolled up counts for actuals and predictions for the groupby.
df_startmonth_week = actuals_compare.groupby('start_month', 'week_start_dt').agg(f.sum('predictions'), f.sum('actuals'))
What do I change in my aggregation step to calculate the RMSE between predictions and actuals? Do I need a UDF to do this?
Here is an example of the end goal I made in excel
| week_start_dt | start_month | RMSE |
|---------------|-------------|------|
| 20-01-2020 | 2019-02 | 2345 |
| 20-01-2020 | 2019-03 | 2343 |
| 20-01-2020 | 2019-04 | 2341 |
| 20-01-2020 | 2019-05 | 2100 |
| 20-01-2020 | 2019-06 | 1234 |