2

I have a function that calculates RMSE for the preds and actuals of an entire dataframe:

def calculate_rmse(df, actual_column, prediction_column):
    RMSE = F.udf(lambda x, y: ((x - y) ** 2))
    df = df.withColumn(
        "RMSE", RMSE(F.col(actual_column), F.col(prediction_column))
    )
    rmse = df.select(F.avg("RMSE") ** 0.5).collect()
    rmse = rmse[0]["POWER(avg(RMSE), 0.5)"]
    return rmse

test = calculate_rmse(my_df, 'actuals', 'preds')

3690.4535

I would like to apply this to a groupby statement, but when I do, I get the following:

df_gb = my_df.groupby('start_month', 'start_week').agg(calculate_rmse(my_df, 'actuals', 'preds'))


all exprs should be Column
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/group.py", line 113, in agg
    assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
AssertionError: all exprs should be Column

Could someone point me in the correct direction? I am fairly new to Pyspark.

sanjayr
  • 1,679
  • 2
  • 20
  • 41

3 Answers3

2

If you want to calculate RMSE by group, a slight adaptation of the solution I proposed to your question

import pyspark.sql.functions as psf

def compute_RMSE(expected_col, actual_col):

  rmse = old_df.withColumn("squarederror",
                           psf.pow(psf.col(actual_col) - psf.col(expected_col),
                                   psf.lit(2)
                           ))
  .groupby('start_month', 'start_week')
  .agg(psf.avg(psf.col("squarederror")).alias("mse"))
  .withColumn("rmse", psf.sqrt(psf.col("mse")))

  return(rmse)


compute_RMSE("col1", "col2")
linog
  • 5,786
  • 3
  • 14
  • 28
  • I think this is super close, I think we need to take the average of the squared error to get the correct answer? – sanjayr Apr 12 '20 at 18:22
  • Nice - my answer had the steps decomposed, but you included it all in one expression, which is necessary for the groupby. – danielcahall Apr 12 '20 at 18:23
  • Yes you're right, that's `avg` not `sum` that is required? I edit in consequence – linog Apr 12 '20 at 18:24
2

I don't think you need a UDF for this - I think you should be able to take the difference between the two columns (df.withColumn('difference', col('true') - col('pred'))), then compute the square of that column (df.withColumn('squared_difference', pow(col('difference'), lit(2).astype(IntegerType()))), and compute the average of the column (df.withColumn('rmse', avg('squared_difference'))). Putting it all together with an example:

from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.getOrCreate()

sql_context = SQLContext(spark.sparkContext)

df = sql_context.createDataFrame([(0.0, 1.0),
                                  (1.0, 2.0),
                                  (3.0, 5.0),
                                  (1.0, 8.0)], schema=['true', 'predicted'])

df = df.withColumn('difference', F.col('true') - F.col('predicted'))
df = df.withColumn('squared_difference', F.pow(F.col('difference'), F.lit(2).astype(IntegerType())))
rmse = df.select(F.avg(F.col('squared_difference')).alias('rmse'))

print(df.show())
print(rmse.show())

Output:

+----+---------+----------+------------------+
|true|predicted|difference|squared_difference|
+----+---------+----------+------------------+
| 0.0|      1.0|      -1.0|               1.0|
| 1.0|      2.0|      -1.0|               1.0|
| 3.0|      5.0|      -2.0|               4.0|
| 1.0|      8.0|      -7.0|              49.0|
+----+---------+----------+------------------+

+-----+
| rmse|
+-----+
|13.75|
+-----+

Hope this helps!

Edit

Sorry, I forgot to take the square root of the result - the last line becomes:

rmse = df.select(F.sqrt(F.avg(F.col('squared_difference'))).alias('rmse'))

and the output becomes:

+------------------+
|              rmse|
+------------------+
|3.7080992435478315|
+------------------+
danielcahall
  • 2,672
  • 8
  • 14
  • RMSE can be computed in two lines of code using Spark ML when it comes to the entire dataset. But the OP wants to compute it **for each separate group** after `groupBy()`. – Hristo Iliev Apr 12 '20 at 20:43
  • Yeah I just realized that - I think linog's answer is the correct one. – danielcahall Apr 12 '20 at 20:45
0
newdf = newdf.withColumn("delx2" , (newdf['true'] - newdf['pred'])**2)
rmseval = (newdf.agg({'delx2': 'sum'}).collect()[0][0]/newdf.count())**0.5
print (f'RMSE value is {round(rmseval,3)}')
  • 2
    This answer was reviewed in the [Low Quality Queue](https://stackoverflow.com/help/review-low-quality). Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). Code only answers are **not considered good answers**, and are likely to be downvoted and/or deleted because they are **less useful** to a community of learners. It's only obvious to you. Explain what it does, and how it's different / **better** than existing answers. [From Review](https://stackoverflow.com/review/low-quality-posts/32509599) – Trenton McKinney Aug 17 '22 at 20:16