I'm able to calculate the Median Absolute Error with this function:
from pyspark.sql import Window
def compute_Median_Abs_Err(df, expected_col, actual_col):
grp_window = Window.partitionBy('grp')
magic_percentile = F.expr('percentile_approx(abserror, 0.5)')
med_abs_err = df.withColumn("abserror",
f.abs(f.col(actual_col) - f.col(expected_col)
)).groupby('start_month', 'start_dt'
).agg(magic_percentile.alias("med_abs_error")
)
return(med_abs_err)
Which can be calculate with this equation:
MEDIAN(abs(predictions - actuals))
I'd like to be able to calculate the Median Absolute Percent Error, calculated with this equation:
MEDIAN( abs(predictions - actuals) / actuals )
I thought I had it correctly with this:
from pyspark.sql import Window
def compute_Median_Perc_Err(df, expected_col, actual_col):
grp_window = Window.partitionBy('grp')
magic_percentile = f.expr('percentile_approx(abserror, 0.5)')
med_perc_err = df.withColumn("abserror",
f.abs(f.col(actual_col) - f.col(expected_col)
)).groupby('start_month', 'start_dt'
).agg(magic_percentile.alias("med_abs_error"), f.avg(f.col(actual_col)).alias("mean")
).withColumn("med_perc_error", f.col("med_abs_error") / f.col("mean"))
return(med_perc_err)
But I realized with this, I am not dividing by the actuals
before taking the median
. I should divide by the actuals first, then take the median of that column.
How do I write this code snippet to divide by the actuals first, since I still need to take .agg(f.avg(f.col("actuals"))
after the groupby to get an accurate mean?