Is there a way to set the largest subset as the base in statsmodels glm, and how do you get hold of the base from the params?

Question

I have two questions on statsmodels GLM:

Is there a way to tell the glm to automatically set the level with most observations as the base (i.e. parameter 0) for each factor? If not, is there a reason for this?
Is there a way to display or extract the names of the base levels (i.e. the level with param = 0) from the GLM? I know the predict function works fine, but I am extracting the GLM output to use it elsewhere and would love to automate this.

I know the workaround that I can use Treatment in the formula e.g. instead of formula='y~C(x)' I can write formula='y~C(x, Treatment("abc"))'. I am using this for question 2 currently, and I suppose I could extend it to question 1 if I chase the data and the formula through a function to enhance the formula, but was wondering if there is some cleaner way to do this or a feature in the pipeline to be able to do this?

Cheers SO

StupidWolf · Answer 1 · 2021-02-23T08:17:08.457

It's better to do this by categorizing the variable before feeding it into the glm. This can be achieved by using pd.Categorial , for example using a simulated dataset:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

np.random.seed(123)
df = pd.DataFrame({'y':np.random.uniform(0,1,100),
'x':np.random.choice(['a','b','c','d'],100)})

Here d would be the reference level since it has most observations:

df.x.value_counts()
 
d    28
b    27
c    26
a    19

If the subsequent order after the reference is not important, you can simply do:

df['x'] = pd.Categorical(df['x'],df.x.value_counts().index)

The reference level is simply:

df.x.cat.categories[0]
'd'

Regression on this:

model = smf.glm(formula = 'y ~ x',data=df).fit()

And you can see the reference is d:

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                            GLM   Df Residuals:                       96
Model Family:                Gaussian   Df Model:                            3
Link Function:               identity   Scale:                        0.059173
Method:                          IRLS   Log-Likelihood:                 1.5121
Date:                Tue, 23 Feb 2021   Deviance:                       5.6806
Time:                        09:16:31   Pearson chi2:                     5.68
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5108      0.046     11.111      0.000       0.421       0.601
x[T.b]        -0.0953      0.066     -1.452      0.146      -0.224       0.033
x[T.c]         0.0633      0.066      0.956      0.339      -0.067       0.193
x[T.a]        -0.0005      0.072     -0.007      0.994      -0.142       0.141
==============================================================================

Another option is to use the treatment you have pointed to, so the first task is to get the top level:

np.random.seed(123)
df = pd.DataFrame({'y':np.random.uniform(0,1,100),
'x':np.random.choice(['a','b','c','d'],100)})

ref = df.x.describe().top

from patsy.contrasts import Treatment
contrast = Treatment(reference=ref)
mod = smf.glm("y ~ C(x, Treatment)", data=df).fit()

Thanks for checking this out mate! The categorisation tip is excellent, I don't need the "C" in the formula any more then. However, I see 'a' has become the base (it is the only one that does not show up in the summary), this is the confusing thing about how this glm works. In the meanwhile I have set up a little function for my workaround, I will put it in a separate answer below — Rob, Feb 23 '21 at 08:01
i pasted the wrong output. if you run the code with categorization, it should give you the intended output. does it answer your question? — StupidWolf, Feb 23 '21 at 08:18
note no one can run your code or use it without a reproducible example — StupidWolf, Feb 23 '21 at 08:18

score 0 · Answer 2 · answered Feb 23 '21 at 08:05

For anyone who might be interested, I implemented the workaround that I mentioned above with the following function. It works fine, however it has the disadvantage that you get into trouble if you have brackets () [] in the levels. You can avoid this trouble by just handing over the treatment already, but not perfect.

def add_treatment_to_formula(formula:str, df:pd.DataFrame, exposure:str):
""" Little helper to add the Treatment field (this is the statsmodel terminology for the base level,
 i.e. the level that gets the parameter=0 (factor=1)) to GLM formulas. It sets it to the level with the
 largest exposure found in the df handed over.

:param formula: Example: 'claimamount ~ age + C(postcode) + C(familystatus, Treatment(2))' will get turned into
'claimamount ~ age + C(postcode, Treatment("12435")) + C(familystatus, Treatment(2))'. The familystatus already had
a treatment, the age is not categorical in this example, so only the postcode gets transformed.
:type formula: str
:param df: DataFrame with at least the columns that show up in the right side of the formula (age, postcode,
familystatus in the example) and the column containing the exposure as named in the exposure argument
:type df: pd.DataFrame()
:param exposure: Name of the column in the df containing the exposure
:type exposure: str
:return: Formula with treatments added
:rtype: str
"""

l_yx = formula.split('~')
if len(l_yx)>2:
    log.error("This does not look like a formula, more than 1 ~ found.")
l_xs = l_yx[1].split('+')
l_xs_enh = []
for x in l_xs:
    # if the treatment field is already set up don't change it. Also, if the field is not
    # categorical, don't change it
    if ('Treatment' in x) | (not 'C(' in x):
        l_xs_enh.append(x)
    else: # get field with largest exposure here and set it as treatment field
        field = x[x.find('(') + 1:x.find(')')].strip()
        df_exposure = df.groupby(field)[exposure].sum().reset_index()
        treatment = df_exposure.loc[df_exposure[exposure]==max(df_exposure[exposure]), field].values[0]
        if isinstance(treatment, str):
            quotes = '"'
        else:
            quotes=''

        x_enh = f'C({field}, Treatment({quotes}{treatment}{quotes}))'
        l_xs_enh.append(x_enh)

formula_enhanced = l_yx[0] + '~ ' + ' + '.join(l_xs_enh)

return formula_enhanced

Is there a way to set the largest subset as the base in statsmodels glm, and how do you get hold of the base from the params?

2 Answers2