0

This is an extension of this older SO question but for python and not R. I also think this solution isn't the best.

Suppose I have data that looks like this...

State   Y
AL      5
AK      10
AZ      8

I want to write a patsy formula to convert State to Region and then use statsmodels to make a prediction using Region. So the table would look like...

State   Region    Y
AL      Southeast 5
AK      Northwest 10
AZ      Southwest 8

I'd like to have a function along the lines of

model = sm.OLS('Y ~ C(State, StateToRegionGrouping)').fit()

I think there are 2 approaches. First, add a lookup column on the original data or write a categorical transformer function for patsy to handle.

Which way is better and, if the patsy categorical transformer is better, what's a good way to program it?

Community
  • 1
  • 1
none
  • 1,187
  • 2
  • 13
  • 17
  • My guess is your patsy encoding might require a more complicated StateToRegionGrouping mapping, and will be more difficult to read and understand. I would just use list comprehension with a dict or pandas to map the states into a region factor, and then use patsy with region as a regular factor variable. – Josef Oct 09 '15 at 01:07

1 Answers1

1

Keep it simple and just use a dictionary mapping:

import statsmodels.formula.api as smf

mapping = {'AL': 'Southeast',
           'AK': 'Northwest',
           'AZ': 'Southwest'}

df = pd.DataFrame({'State': ['AL', 'AK', 'AZ'], 'Y': [5, 10, 8]})
df['Region'] = df.State.map(mapping)

>>> df
  State   Y     Region
0    AL   5  Southeast
1    AK  10  Northwest
2    AZ   8  Southwest

model = smf.ols('Y ~ Region', data=df).fit()    
Alexander
  • 105,104
  • 32
  • 201
  • 196