Is it possible to pass a scalar value to a Pandas UDF Function along with Pandas Series

Question

I'm trying to use scipy.optimize.minimize function on two columns of pyspark dataframe.

while passing x0 parameter as array to the Pandas UDF function, i am getting following error:

TypeError: Invalid argument, not a string or column: [0.9  0.5  2.5  5.   0.33] of type <class 'numpy.ndarray'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

This is the function i am trying to minimize

def eb_func(theta, n, e):
    """
    # Function to be Minimized

    :param theta: float
    :param n: Pandas.Series
    :param e: Pandas.Series
    :return: float

    """
    print("Entering EB_Func")
    res = res = np.prod(theta[4] * neg_bin(n, e, theta[0], theta[1]) + (1 - theta[4]) * neg_bin(n, e, theta[2], theta[3]))
    return res

This is my neg_bin function:

@pandas_udf('double', PandasUDFType.SCALAR)
def neg_bin(n, e, alpha, beta):
    """

    :param n:
    :param e:
    :param alpha:
    :param beta:
    :return:
    """
    res_expo = gammaln(alpha + n) - gammaln(n + 1) - gammaln(alpha)
    res = np.exp(res_expo)
    res = res / (1 + beta / (e + 0.01)) ** n
    res = res / (1 + e / beta) ** alpha
    return res

These are my parameters:

x0 = np.array([0.9, 0.5, 2.5, 5, 0.33])
bounds = ([0.000001, 200], [0.000001, 200], [0.000001, 200], [0.000001, 200], [0.000001, 1])

This is where i am trying to call the scipy.optimize.minimize Function.

# Define a function to call minimize function
def RunMinimize(data):
    Result = minimize(eb_func, x0, args=(data.Adolescent_a, data.Adolescent_e), method='L-BFGS-B', bounds=bounds, options={'disp': True, 'maxiter': 1000, 'eps': np.repeat(1e-4, 5)})
    return Result.x


RunMinimize(df_adol)

I am new to PySpark, i can do this in Pandas but now i have a huge dataset and Pandas is taking a lot of time to process that.

Following is the expected Output Format: This is what i get as output in Pandas

[1.00000000e-06, 1.46304225e+00, 1.00000000e-06, 6.39066185e+00, 1.00000000e-06])

I am having trouble passing theta values to the neg_bin function. Because neg_bin function only expects pandas.Series as input. I am looking for a workaround to send theta values as scalar along with pandas.Series as input to the neg_bin function, if possible.

Any help is appreciated. TIA.

karpan · Answer 1 · 2022-05-21T11:58:31.083

I tried to follow your example, but unfortunately not all functions are defined and no import statements are included. Hence, I provide a simpler example below (convert temperatures to/from C and F).

The idea is to wrap the pandas UDF in another function that takes the necessary scalar arguments. The example runs in pyspark 3.2+ and some adaptation may be needed for earlier versions.

import pandas as pd
import pyspark.sql.functions as F
import pyspark.sql.types as T

df = [[1. , 1.1], [2., 2.1]]
df = spark.createDataFrame(df, schema = ['x', 'y'])

def temp_to_temp(from_temp: str, to_temp: str) -> pd.Series:
    @F.pandas_udf(T.DoubleType())
    def temp_to_temp_inner(value: pd.Series) -> pd.Series:
        if to_temp == 'C':
            if from_temp == 'F':
                return (value - 32)*5./9
            else:
                return value
        elif to_temp == 'F':
            if from_temp == 'C':
                return value*9./5 + 32
            else:
                return value
    return temp_to_temp_inner

res = df.select(temp_to_temp('C', 'F')(F.col('x')).alias('temp (F)'))

res.show()
# |temp (F)|
# +--------+
# |    33.8|
# |    35.6|
# +--------+

where spark is the spark session.

Is it possible to pass a scalar value to a Pandas UDF Function along with Pandas Series

1 Answers1