Using GLM to reproduce built-in regression models in statsmodels

Question

I am currently trying to reproduce a regression model eq. (3) (edit: fixed link) in python using statsmodels. As this model is no part of the standard models provided by statsmodels I clearly have to write it myself using the provided formula api.

Since I have never worked with the formula api (or patsy for that matter), I wanted to start and verify my approach by reproducing standard models with the formula api and a generalized linear model. My code and the results for a poisson regression are given below at the end of my question.

You will see that it predicts the parameters beta = (2, -3, 1) for all three models with good accuracy. However, I have a couple of questions:

How do I explicitly add covariates to the glm model with a regression coefficient equal to 1?

From what I understand, a poisson regression in general has the shape ln(counts) = exp(intercept + beta * x + log(exposure)), i.e. the exposure is added through a fixed constant of value 1. I would like to reproduce this behaviour in my glm model, i.e. I want something like ln(counts) = exp(intercept + beta * x + k * log(exposure)) where k is a fixed constant as a formula.

Simply using formula = "1 + x1 + x2 + x3 + np.log(exposure)" returns a perfect seperation error (why?). I can bypass that by adding some random noise to y, but in that case np.log(exposure) has a non-unity regression coefficient, i.e. it is treated as a normal regression covariate.

Apparently both built-in models 1 and 2 have no intercept, eventhough I tried to explicitly add one in model 2. Or is there a hidden intercept that is simply not reported? In either case, how do I fix that?

Any help would be greatly appreciated, so thanks in advance!

import numpy as np
import pandas as pd

np.random.seed(1+8+2022)

# Number of random samples
n = 4000

# Intercept
alpha = 1

# Regression Coefficients
beta = np.asarray([2.0,-3.0,1.0])

# Random Data
data =  {
            "x1" :  np.random.normal(1.00,0.10, size = n),
            "x2" :  np.random.normal(1.50,0.15, size = n),
            "x3" :  np.random.normal(-2.0,0.20, size = n),
            "exposure":  np.random.poisson(14, size = n),
        }

# Calculate the response
x = np.asarray([data["x1"], data["x2"] , data["x3"]]).T
t = np.asarray(data["exposure"])

# Add response to random data
data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t))

# Convert dict to df
data = pd.DataFrame(data)

print(data)

#-----------------------------------------------------
#   My Model
#-----------------------------------------------------

import statsmodels.api as sm
import statsmodels.formula.api as smf

formula = "y ~ x1 + x2 + x3"
model = smf.glm(formula=formula, data=data, family=sm.families.Poisson()).fit()

print(model.summary())

#-----------------------------------------------------
#   statsmodels.discrete.discrete_model.Poisson 1
#-----------------------------------------------------

import statsmodels.api as sm

data["offset"] = np.ones(n)

model = sm.Poisson( endog = data["y"],
                    exog = data[["x1", "x2", "x3"]],
                    exposure = data["exposure"],
                    offset = data["offset"]).fit()

print(model.summary())

#-----------------------------------------------------
#   statsmodels.discrete.discrete_model.Poisson 2
#-----------------------------------------------------

import statsmodels.api as sm

data["x1"] = sm.add_constant(data["x1"])

model = sm.Poisson( endog = data["y"],
                    exog = data[["x1", "x2", "x3"]],
                    exposure = data["exposure"]).fit()

print(model.summary())

RESULTS:

            x1        x2        x3  exposure         y
0     1.151771  1.577677 -1.811903        13  0.508422
1     0.897012  1.678311 -2.327583        22  0.228219
2     1.040250  1.471962 -1.705458        13  0.621328
3     0.866195  1.512472 -1.766108        17  0.478107
4     0.925470  1.399320 -1.886349        13  0.512518
...        ...       ...       ...       ...       ...
3995  1.073945  1.365260 -1.755071        12  0.804081
3996  0.855000  1.251951 -2.173843        11  0.439639
3997  0.892066  1.710856 -2.183085        10  0.107643
3998  0.763777  1.538938 -2.013619        22  0.363551
3999  1.056958  1.413922 -1.722252        19  1.098932

[4000 rows x 5 columns]
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 4000
Model:                            GLM   Df Residuals:                     3996
Model Family:                 Poisson   Df Model:                            3
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2743.7
Date:                Sat, 08 Jan 2022   Deviance:                       141.11
Time:                        09:32:32   Pearson chi2:                     140.
No. Iterations:                     4
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.6857      0.378      9.755      0.000       2.945       4.426
x1             2.0020      0.227      8.800      0.000       1.556       2.448
x2            -3.0393      0.148    -20.604      0.000      -3.328      -2.750
x3             0.9937      0.114      8.719      0.000       0.770       1.217
==============================================================================
Optimization terminated successfully.
         Current function value: 0.668293
         Iterations 10
                          Poisson Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 4000
Model:                        Poisson   Df Residuals:                     3997
Method:                           MLE   Df Model:                            2
Date:                Sat, 08 Jan 2022   Pseudo R-squ.:                 0.09462
Time:                        09:32:32   Log-Likelihood:                -2673.2
converged:                       True   LL-Null:                       -2952.6
Covariance Type:            nonrobust   LLR p-value:                4.619e-122
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             2.0000      0.184     10.875      0.000       1.640       2.360
x2            -3.0000      0.124    -24.160      0.000      -3.243      -2.757
x3             1.0000      0.094     10.667      0.000       0.816       1.184
==============================================================================
Optimization terminated successfully.
         Current function value: 0.677893
         Iterations 5
                          Poisson Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 4000
Model:                        Poisson   Df Residuals:                     3997
Method:                           MLE   Df Model:                            2
Date:                Sat, 08 Jan 2022   Pseudo R-squ.:                 0.08162
Time:                        09:32:32   Log-Likelihood:                -2711.6
converged:                       True   LL-Null:                       -2952.6
Covariance Type:            nonrobust   LLR p-value:                2.196e-105
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             2.9516      0.304      9.711      0.000       2.356       3.547
x2            -2.9801      0.147    -20.275      0.000      -3.268      -2.692
x3             0.9807      0.113      8.655      0.000       0.759       1.203
==============================================================================

Process return 0 (0x0)
Press Enter to continue...

`exog = sm.add_constant(data[["x1", "x2", "x3"]])` exog with be a dataframe with an extra constant column — Josef, Jan 08 '22 at 18:10
main problem: your `y` are not counts and have no random component which will problems like perfect separation/prediction. You can use np.random.poisson to generate simulated Poisson count data. — Josef, Jan 08 '22 at 18:14
@Josef Thank you for the comment, that definitively helps. For some reason I didn't even think about adding the constant to the entire exog dataframe. Do you, by any chance, also have an answer to my first question regarding parameters with fixed coefficient? — ICST, Jan 08 '22 at 18:31
Your usage of offset and exposure looks correct to me. The `perfect seperation error` most likely picks up perfect prediction in this case because your y is deterministic and can be perfectly matched with the model. — Josef, Jan 08 '22 at 20:12
When you manage to work it out, you could provide your own answer with the corrected version. — Josef, Jan 08 '22 at 20:14
@Josef You're right in that my model had values between 0 and 1, i.e. not count data. When fixing this oversight (e.g. by setting`alpha = 5` and `data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t)).astype(int)`) I no longer get an `perfect seperation error`. However, adding `formula = "y ~ 1 + x1 + x2 + x3 + np.log(exposure)"` now yields `np.log(exposure) 1.0214 0.012 86.486 0.0 0.998 1.045` under `model.summary()`. This is not precisely what I want, as I want to force the regression coefficient to 1 for the exposure term. This is the only thing left for me :) — ICST, Jan 08 '22 at 20:44
use `mu=np.exp(alpha + np.dot(x,beta) + np.log(t))` and `np.random.poisson(mu, size=n)`. Exposure is usually assumed to be given by the sampling not random, e.g. total amount of time of exposure of an individual, exposure can be continuous. — Josef, Jan 08 '22 at 21:32
@Josef Hello Josef, these are all good advices and I adapted my code accordingly. This does however not seem to solve my original question as the regression coefficient for the exposure is still != 1? How do I reproduce the `sm.Poisson` model through `smf.glm`? I could also ask how to model a relationship of the kind `array(mu) = array(k) * exp(regression coefficients)` for an array `k`, i.e. there is a multiplicative constant k_i for each mu_i that is not subject to regression. Is such a model possible within the scope of `smf.glm`? — ICST, Jan 09 '22 at 07:47
You need to incorporate a constant different from one in the offset, e.g. `offset = k * log(exposure)`. However, `k` in array(mu) = array(k) * exp(regression coefficients)` is just the exposure, i.e. `Poisson(...., exposure=k)` with k varying across observations. — Josef, Jan 09 '22 at 14:42
GLM has the same Poisson model but you need to specify the `family`. — Josef, Jan 09 '22 at 14:43
@Josef: Thank you so much, this has solved my issues. When I find the time I will update my question with an answer, unfortunately I am a bit busy right now. — ICST, Jan 11 '22 at 18:03

Using GLM to reproduce built-in regression models in statsmodels

0 Answers0