My question is similar to (Python pandas: how to run multiple univariate regression by group). I have a set of regressions to run by group but in my case the regression coefficients are bounded between 0 and 1 and there is a constraint that the sum of the regression coefficients should be = 1. I tried to solve it as an optimization problem; first using the whole data frame (disregarding the groups).
import pandas as pd
import numpy as np
df = pd.DataFrame({
'y0': np.random.randn(20),
'y1': np.random.randn(20),
'x0': np.random.randn(20),
'x1': np.random.randn(20),
'grpVar': ['a', 'b'] * 10})
def SumSqDif(a):
return np.sum((df['y0'] - a[0]*df['x0'])**2 + (df['y1'] - a[1]*df['x1'])**2 )
# Starting values
startVal = np.ones(2)*(1/2)
#Constraint Sum of coefficients = 0
cons = ({'type':'eq', 'fun': lambda x: 1 - sum(x)})
# Bounds on coefficients
bnds = tuple([0,1] for x in startVal)
# Solve the optimization problem using the full dataframe (disregarding groups)
from scipy.optimize import minimize
Result = minimize(SumSqDif, startVal , method='SLSQP' , bounds=bnds , constraints = cons )
Result.x
Then I tried to use data frame group
by and apply()
. However the error I get is
TypeError: unhashable type: 'numpy.ndarray'.
# Try to Solve the optimization problem By group
# Create GroupBy object
grp_grpVar = df.groupby('grpVar')
def RunMinimize(data):
ResultByGrp = minimize(SumSqDif, startVal , method='SLSQP' , bounds=bnds , constraints = cons )
return ResultByGrp.x
grp_grpVar.apply(RunMinimize(df))
This probably could be done by iterating over a loop, however my actual data contains about 70 million groups and I thought that data frame group by and apply()
would be more efficient.
I am new to Python. I searched this and other sites but could not find any example of data frame apply()
and scipy.optimize.minimize
.
Any ideas will be appreciated?