1

My question is similar to (Python pandas: how to run multiple univariate regression by group). I have a set of regressions to run by group but in my case the regression coefficients are bounded between 0 and 1 and there is a constraint that the sum of the regression coefficients should be = 1. I tried to solve it as an optimization problem; first using the whole data frame (disregarding the groups).

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'y0': np.random.randn(20),
    'y1': np.random.randn(20),
    'x0': np.random.randn(20), 
    'x1': np.random.randn(20),
    'grpVar': ['a', 'b'] * 10})

def SumSqDif(a):
     return np.sum((df['y0'] - a[0]*df['x0'])**2 + (df['y1'] - a[1]*df['x1'])**2 )

# Starting values
startVal = np.ones(2)*(1/2)

#Constraint  Sum of coefficients = 0
cons = ({'type':'eq', 'fun': lambda x: 1 - sum(x)})

# Bounds on coefficients
bnds = tuple([0,1] for x in startVal)

# Solve the optimization problem using the full dataframe (disregarding groups)
from scipy.optimize import minimize
Result = minimize(SumSqDif, startVal , method='SLSQP' , bounds=bnds , constraints = cons )
Result.x

Then I tried to use data frame group by and apply(). However the error I get is

TypeError: unhashable type: 'numpy.ndarray'.

# Try to Solve the optimization problem By group
# Create GroupBy object
grp_grpVar = df.groupby('grpVar')

def RunMinimize(data):
    ResultByGrp = minimize(SumSqDif, startVal , method='SLSQP' , bounds=bnds , constraints = cons )
    return ResultByGrp.x

grp_grpVar.apply(RunMinimize(df))

This probably could be done by iterating over a loop, however my actual data contains about 70 million groups and I thought that data frame group by and apply() would be more efficient. I am new to Python. I searched this and other sites but could not find any example of data frame apply() and scipy.optimize.minimize. Any ideas will be appreciated?

Tonechas
  • 13,398
  • 16
  • 46
  • 80
Paul
  • 51
  • 1
  • 2

1 Answers1

0

I believe what you want is this:

# add df parameter to your `SumSqDif` function signature, so that when you apply
# this function to your grouped by dataframe, the groups gets passed
# as the df argument to this function
def SumSqDif(a, df):
    return np.sum((df['y0'] - a[0]*df['x0'])**2 + (df['y1'] - a[1]*df['x1'])**2)

# add startVal, bnds, and cons as additional parameters 
# The way you wrote your function signature is that it
# uses these values from the global namespace, which is not good practice,
# because you're assuming these values exist in the global scope,
# which may not always be true
def RunMinimize(data, startVal, bnds, cons):
    # add additional argument of data into the minimize function
    # this passes the group as the df to SumSqDif
    ResultByGrp = minimize(SumSqDif, startVal, method='SLSQP',
                           bounds=bnds, constraints = cons, args=(data))
    return ResultByGrp.x

# Here, you're passing the startVal, bnds, and cons are arguments as
# additional keyword arguments to `apply`
df.groupby('grpVar').apply(RunMinimize, startVal=startVal, bnds=bnds, cons=cons))
Scratch'N'Purr
  • 9,959
  • 2
  • 35
  • 51
  • Great. Exactly what I needed. Thank you very much, Scratch'N'Purr. – Paul Jul 04 '17 at 17:17
  • No problem! Do you mind upvoting my question please? :D – Scratch'N'Purr Jul 05 '17 at 15:38
  • I tried to upvote it but I can't. There was a message that votes of those with reputation less than 15 are recorded but do not change the publicly displayed score. Sorry. – Paul Jul 06 '17 at 01:09