It's better to do this by categorizing the variable before feeding it into the glm. This can be achieved by using pd.Categorial
, for example using a simulated dataset:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(123)
df = pd.DataFrame({'y':np.random.uniform(0,1,100),
'x':np.random.choice(['a','b','c','d'],100)})
Here d
would be the reference level since it has most observations:
df.x.value_counts()
d 28
b 27
c 26
a 19
If the subsequent order after the reference is not important, you can simply do:
df['x'] = pd.Categorical(df['x'],df.x.value_counts().index)
The reference level is simply:
df.x.cat.categories[0]
'd'
Regression on this:
model = smf.glm(formula = 'y ~ x',data=df).fit()
And you can see the reference is d:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: GLM Df Residuals: 96
Model Family: Gaussian Df Model: 3
Link Function: identity Scale: 0.059173
Method: IRLS Log-Likelihood: 1.5121
Date: Tue, 23 Feb 2021 Deviance: 5.6806
Time: 09:16:31 Pearson chi2: 5.68
No. Iterations: 3
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.5108 0.046 11.111 0.000 0.421 0.601
x[T.b] -0.0953 0.066 -1.452 0.146 -0.224 0.033
x[T.c] 0.0633 0.066 0.956 0.339 -0.067 0.193
x[T.a] -0.0005 0.072 -0.007 0.994 -0.142 0.141
==============================================================================
Another option is to use the treatment you have pointed to, so the first task is to get the top level:
np.random.seed(123)
df = pd.DataFrame({'y':np.random.uniform(0,1,100),
'x':np.random.choice(['a','b','c','d'],100)})
ref = df.x.describe().top
from patsy.contrasts import Treatment
contrast = Treatment(reference=ref)
mod = smf.glm("y ~ C(x, Treatment)", data=df).fit()