Specifying reference category with 'statsmodels.formula.api' glm for dependent variable

Question

I read this link and tried to change the reference category for the dependent variable when using statsmodels.formula.api's glm(formula = "C(y,Treatment(reference=-1)) ~ x1 + x2", data=dta, family=sm.families.Binomial()).

The dependent variable can only takes 2 valuesy={-1,1}. I specified the reference category as above and even tried changing the reference category from -1 to 1 , yet the sign of the logistic regression coefficients is still the same. What did I do wrong here ?

It's also confusing that the logistic regression output does not tell whether an increase in x1 is having a negative impact on probability of -1 or 1 . Can someone help me out here please ?

                                                          Generalized Linear Model Regression Results                                                          
===============================================================================================================================================================
Dep. Variable:     ["C(y, Treatment(reference=-1))[-1.0]", "C(y, Treatment(reference=-1))[1.0]"]   No. Observations:                 3311
Model:                                                                                                             GLM   Df Residuals:                     3309
Model Family:                                                                                                 Binomial   Df Model:                            1
Link Function:                                                                                                   logit   Scale:                          1.0000
Method:                                                                                                           IRLS   Log-Likelihood:                -2292.4
Date:                                                                                                 Wed, 17 Nov 2021   Deviance:                       4584.8
Time:                                                                                                         22:51:58   Pearson chi2:                 3.31e+03
No. Iterations:                                                                                                      4                                         
Covariance Type:                                                                                             nonrobust                                         
====================================================================================================
           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
x1      -0.1769      0.120     -1.473      0.141      -0.412       0.058
x2       0.2388      0.110      2.164      0.030       0.022       0.455
====================================================================================================

score 1 · Answer 1 · answered Nov 20 '21 at 14:49

The link you see have cited is for setting the contrast for your independent variable. You are trying to change the reference for your dependent variable. You can either binarize it right from the start, or set the categories:

import statsmodels.formula.api as smf
import statsmodels.api as sm
import numpy as np
import pandas

df = pd.DataFrame({'Treatment':np.random.choice([-1,1],50),
'x1':np.random.normal(0,1,50),
'x2':np.random.uniform(0,1,50)})

df['Treatment'] = pd.Categorical(df['Treatment'])

Here we can see that the -1 comes before 1

df['Treatment'].cat.categories
Int64Index([-1, 1], dtype='int64')

mdl = smf.glm(formula = "Treatment ~ x1 + x2",
data=df, family=sm.families.Binomial())
res = mdl.fit()
res.params
 
Intercept   -0.652064
x1          -0.184368
x2           1.280864

Now flip and you'll see your coefficients flip sign:

df['Treatment'] = pd.Categorical(df['Treatment'],categories = [1,-1])

df['Treatment'].cat.categories
Int64Index([1, -1], dtype='int64')

mdl = smf.glm(formula = "Treatment ~ x1 + x2",
data=df, family=sm.families.Binomial())
res = mdl.fit()
res.params
 
Intercept    0.652064
x1           0.184368
x2          -1.280864

is this related to the question at all? – StupidWolf Nov 20 '21 at 18:40 — StupidWolf, Nov 20 '21 at 18:40
yes, this is definitely relevant to the question (imo) – davedgd Dec 13 '21 at 03:03 — davedgd, Dec 13 '21 at 03:03

Specifying reference category with 'statsmodels.formula.api' glm for dependent variable

1 Answers1