1

I have data in a pyspark dataframe like this over the following few years. The week_start_dt is when I kicked off a prediction. And the start_month are the previous 12 months.

+--------------------+------------------+----------------------+----------------+
|     start_month    |     week_start_dt|           predictions|       actuals  |
+--------------------+------------------+----------------------+----------------+
|             2019-01|        2019-11-11|                    12|              11|
|             2018-12|        2019-11-11|                    13|              11|
|             2019-08|        2019-11-11|                     9|              11|
|             2019-11|        2019-11-11|                    12|              11|
|             2019-11|        2019-11-11|                  1970|            1440|
|             2019-11|        2019-11-11|                   478|             501|
+--------------------+------------------+----------------------+----------------+

I would like to calculate the RMSE with a groupby on the start_month and week_start_dt. I think it would require a User Defined Aggregate Function. Something along the lines of this in pandas: Python Dataframe: Calculating R^2 and RMSE Using Groupby on One Column

I use the following code to just get rolled up counts for actuals and predictions for the groupby.

df_startmonth_week = actuals_compare.groupby('start_month', 'week_start_dt').agg(f.sum('predictions'), f.sum('actuals'))

What do I change in my aggregation step to calculate the RMSE between predictions and actuals? Do I need a UDF to do this?

Here is an example of the end goal I made in excel

| week_start_dt | start_month | RMSE |
|---------------|-------------|------|
| 20-01-2020    | 2019-02     | 2345 |
| 20-01-2020    | 2019-03     | 2343 |
| 20-01-2020    | 2019-04     | 2341 |
| 20-01-2020    | 2019-05     | 2100 |
| 20-01-2020    | 2019-06     | 1234 |
sanjayr
  • 1,679
  • 2
  • 20
  • 41
  • I would go with an udf - Also, check out https://spark.apache.org/docs/1.6.3/mllib-evaluation-metrics.html#regression-model-evaluation – Chinny84 Apr 12 '20 at 00:14

1 Answers1

1

I don't see the difference with the problem here so I adapted the solution to the variable names that are slightly different :

import pyspark.sql.functions as psf

def compute_RMSE(expected_col, actual_col):

  rmse = old_df.withColumn("squarederror",
                           psf.pow(psf.col(actual_col) - psf.col(expected_col),
                                   psf.lit(2)
                           ))
  .groupby('start_month', 'week_start_dt')
  .agg(psf.avg(psf.col("squarederror")).alias("mse"))
  .withColumn("rmse", psf.sqrt(psf.col("mse")))

  return(rmse)


compute_RMSE("predictions", "actuals")

Tell me if I missed a slight difference in the problem

linog
  • 5,786
  • 3
  • 14
  • 28