I'm trying to use scipy.optimize.minimize
function on two columns of pyspark dataframe.
while passing x0
parameter as array to the Pandas UDF function, i am getting following error:
TypeError: Invalid argument, not a string or column: [0.9 0.5 2.5 5. 0.33] of type <class 'numpy.ndarray'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
This is the function i am trying to minimize
def eb_func(theta, n, e):
"""
# Function to be Minimized
:param theta: float
:param n: Pandas.Series
:param e: Pandas.Series
:return: float
"""
print("Entering EB_Func")
res = res = np.prod(theta[4] * neg_bin(n, e, theta[0], theta[1]) + (1 - theta[4]) * neg_bin(n, e, theta[2], theta[3]))
return res
This is my neg_bin function:
@pandas_udf('double', PandasUDFType.SCALAR)
def neg_bin(n, e, alpha, beta):
"""
:param n:
:param e:
:param alpha:
:param beta:
:return:
"""
res_expo = gammaln(alpha + n) - gammaln(n + 1) - gammaln(alpha)
res = np.exp(res_expo)
res = res / (1 + beta / (e + 0.01)) ** n
res = res / (1 + e / beta) ** alpha
return res
These are my parameters:
x0 = np.array([0.9, 0.5, 2.5, 5, 0.33])
bounds = ([0.000001, 200], [0.000001, 200], [0.000001, 200], [0.000001, 200], [0.000001, 1])
This is where i am trying to call the scipy.optimize.minimize
Function.
# Define a function to call minimize function
def RunMinimize(data):
Result = minimize(eb_func, x0, args=(data.Adolescent_a, data.Adolescent_e), method='L-BFGS-B', bounds=bounds, options={'disp': True, 'maxiter': 1000, 'eps': np.repeat(1e-4, 5)})
return Result.x
RunMinimize(df_adol)
I am new to PySpark, i can do this in Pandas but now i have a huge dataset and Pandas is taking a lot of time to process that.
Following is the expected Output Format: This is what i get as output in Pandas
[1.00000000e-06, 1.46304225e+00, 1.00000000e-06, 6.39066185e+00, 1.00000000e-06])
I am having trouble passing theta values to the neg_bin function. Because neg_bin function only expects pandas.Series as input. I am looking for a workaround to send theta values as scalar along with pandas.Series as input to the neg_bin function, if possible.
Any help is appreciated. TIA.