This is an extension of this older SO question but for python and not R. I also think this solution isn't the best.
Suppose I have data that looks like this...
State Y
AL 5
AK 10
AZ 8
I want to write a patsy formula to convert State to Region and then use statsmodels to make a prediction using Region. So the table would look like...
State Region Y
AL Southeast 5
AK Northwest 10
AZ Southwest 8
I'd like to have a function along the lines of
model = sm.OLS('Y ~ C(State, StateToRegionGrouping)').fit()
I think there are 2 approaches. First, add a lookup column on the original data or write a categorical transformer function for patsy to handle.
Which way is better and, if the patsy categorical transformer is better, what's a good way to program it?