PySpark applying UDF for exponential weighted mean from collect_list array

Question

I'm ultimately hoping to recover similar functionality to that detailed in Pyspark SPARK-22239, which will enable the use of window functions with Pandas user-defined functions.

Specifically, I'm performing a timestamp-based windowing of underlying numerical observations, and then computing the exponential weighted mean for each window. I have a working approach, but am concerned that it may be inefficient, and that I may be overlooking a better-optimised solution.

I've approached this problem by using collect_list to obtain the array of numerical values corresponding to the correct time window for each row in the dataframe, and then applying a UDF to compute the exponential weighted mean for each array.

from pyspark.sql.functions import udf, collect_list
from pyspark.sql.types import DoubleType
from pyspark.sql.window import Window


# collect the relevant set of prices for the moving average per row
def mins(t_mins):
    """
    Utility function converting time in mins to time in secs.
    """
    return 60 * t_mins
w = Window.orderBy('date').rangeBetween(-mins(30), 0)
df = df.withColumn('windowed_price', collect_list('price').over(window))

# compute the exponential weighted mean from each array of prices
@udf(DoubleType())
def arr_to_ewm(arr):
    """
    Computes exponential weighted mean per row from array of relevant time points.
    """
    series = pd.Series(arr)
    ewm = series.ewm(alpha=0.5).mean().iloc[-1]
    # make sure return type is python primitive instead of Numpy dtype
    return float(ewm)
df = df.withColumn('price_ema_30mins', arr_to_ewm(df.windowed_price))

The above approach works, but I understand that both the collect_list and the udf are computationally expensive. Is there a more efficient approach to performing this computation in Pyspark?

PySpark applying UDF for exponential weighted mean from collect_list array

0 Answers0