3

I'm trying to use scipy.optimize.minimize function on two columns of pyspark dataframe.

while passing x0 parameter as array to the Pandas UDF function, i am getting following error:

TypeError: Invalid argument, not a string or column: [0.9  0.5  2.5  5.   0.33] of type <class 'numpy.ndarray'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

This is the function i am trying to minimize

def eb_func(theta, n, e):
    """
    # Function to be Minimized

    :param theta: float
    :param n: Pandas.Series
    :param e: Pandas.Series
    :return: float

    """
    print("Entering EB_Func")
    res = res = np.prod(theta[4] * neg_bin(n, e, theta[0], theta[1]) + (1 - theta[4]) * neg_bin(n, e, theta[2], theta[3]))
    return res

This is my neg_bin function:

@pandas_udf('double', PandasUDFType.SCALAR)
def neg_bin(n, e, alpha, beta):
    """

    :param n:
    :param e:
    :param alpha:
    :param beta:
    :return:
    """
    res_expo = gammaln(alpha + n) - gammaln(n + 1) - gammaln(alpha)
    res = np.exp(res_expo)
    res = res / (1 + beta / (e + 0.01)) ** n
    res = res / (1 + e / beta) ** alpha
    return res

These are my parameters:

x0 = np.array([0.9, 0.5, 2.5, 5, 0.33])
bounds = ([0.000001, 200], [0.000001, 200], [0.000001, 200], [0.000001, 200], [0.000001, 1])

This is where i am trying to call the scipy.optimize.minimize Function.

# Define a function to call minimize function
def RunMinimize(data):
    Result = minimize(eb_func, x0, args=(data.Adolescent_a, data.Adolescent_e), method='L-BFGS-B', bounds=bounds, options={'disp': True, 'maxiter': 1000, 'eps': np.repeat(1e-4, 5)})
    return Result.x


RunMinimize(df_adol)

I am new to PySpark, i can do this in Pandas but now i have a huge dataset and Pandas is taking a lot of time to process that.

Following is the expected Output Format: This is what i get as output in Pandas

[1.00000000e-06, 1.46304225e+00, 1.00000000e-06, 6.39066185e+00, 1.00000000e-06])

I am having trouble passing theta values to the neg_bin function. Because neg_bin function only expects pandas.Series as input. I am looking for a workaround to send theta values as scalar along with pandas.Series as input to the neg_bin function, if possible.

Any help is appreciated. TIA.

Vaibhav Rathi
  • 340
  • 1
  • 4
  • 16

1 Answers1

2

I tried to follow your example, but unfortunately not all functions are defined and no import statements are included. Hence, I provide a simpler example below (convert temperatures to/from C and F).

The idea is to wrap the pandas UDF in another function that takes the necessary scalar arguments. The example runs in pyspark 3.2+ and some adaptation may be needed for earlier versions.

import pandas as pd
import pyspark.sql.functions as F
import pyspark.sql.types as T

df = [[1. , 1.1], [2., 2.1]]
df = spark.createDataFrame(df, schema = ['x', 'y'])

def temp_to_temp(from_temp: str, to_temp: str) -> pd.Series:
    @F.pandas_udf(T.DoubleType())
    def temp_to_temp_inner(value: pd.Series) -> pd.Series:
        if to_temp == 'C':
            if from_temp == 'F':
                return (value - 32)*5./9
            else:
                return value
        elif to_temp == 'F':
            if from_temp == 'C':
                return value*9./5 + 32
            else:
                return value
    return temp_to_temp_inner

res = df.select(temp_to_temp('C', 'F')(F.col('x')).alias('temp (F)'))

res.show()
# |temp (F)|
# +--------+
# |    33.8|
# |    35.6|
# +--------+

where spark is the spark session.

karpan
  • 421
  • 1
  • 5
  • 13