I am currently trying to reproduce a regression model eq. (3) (edit: fixed link) in python using statsmodels. As this model is no part of the standard models provided by statsmodels I clearly have to write it myself using the provided formula api.
Since I have never worked with the formula api (or patsy for that matter), I wanted to start and verify my approach by reproducing standard models with the formula api and a generalized linear model. My code and the results for a poisson regression are given below at the end of my question.
You will see that it predicts the parameters beta = (2, -3, 1) for all three models with good accuracy. However, I have a couple of questions:
- How do I explicitly add covariates to the glm model with a regression coefficient equal to 1?
From what I understand, a poisson regression in general has the shape ln(counts) = exp(intercept + beta * x + log(exposure))
, i.e. the exposure is added through a fixed constant of value 1. I would like to reproduce this behaviour in my glm model, i.e. I want something like ln(counts) = exp(intercept + beta * x + k * log(exposure))
where k
is a fixed constant as a formula.
Simply using formula = "1 + x1 + x2 + x3 + np.log(exposure)"
returns a perfect seperation error (why?). I can bypass that by adding some random noise to y
, but in that case np.log(exposure)
has a non-unity regression coefficient, i.e. it is treated as a normal regression covariate.
- Apparently both built-in models 1 and 2 have no intercept, eventhough I tried to explicitly add one in model 2. Or is there a hidden intercept that is simply not reported? In either case, how do I fix that?
Any help would be greatly appreciated, so thanks in advance!
import numpy as np
import pandas as pd
np.random.seed(1+8+2022)
# Number of random samples
n = 4000
# Intercept
alpha = 1
# Regression Coefficients
beta = np.asarray([2.0,-3.0,1.0])
# Random Data
data = {
"x1" : np.random.normal(1.00,0.10, size = n),
"x2" : np.random.normal(1.50,0.15, size = n),
"x3" : np.random.normal(-2.0,0.20, size = n),
"exposure": np.random.poisson(14, size = n),
}
# Calculate the response
x = np.asarray([data["x1"], data["x2"] , data["x3"]]).T
t = np.asarray(data["exposure"])
# Add response to random data
data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t))
# Convert dict to df
data = pd.DataFrame(data)
print(data)
#-----------------------------------------------------
# My Model
#-----------------------------------------------------
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "y ~ x1 + x2 + x3"
model = smf.glm(formula=formula, data=data, family=sm.families.Poisson()).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 1
#-----------------------------------------------------
import statsmodels.api as sm
data["offset"] = np.ones(n)
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"],
offset = data["offset"]).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 2
#-----------------------------------------------------
import statsmodels.api as sm
data["x1"] = sm.add_constant(data["x1"])
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"]).fit()
print(model.summary())
RESULTS:
x1 x2 x3 exposure y
0 1.151771 1.577677 -1.811903 13 0.508422
1 0.897012 1.678311 -2.327583 22 0.228219
2 1.040250 1.471962 -1.705458 13 0.621328
3 0.866195 1.512472 -1.766108 17 0.478107
4 0.925470 1.399320 -1.886349 13 0.512518
... ... ... ... ... ...
3995 1.073945 1.365260 -1.755071 12 0.804081
3996 0.855000 1.251951 -2.173843 11 0.439639
3997 0.892066 1.710856 -2.183085 10 0.107643
3998 0.763777 1.538938 -2.013619 22 0.363551
3999 1.056958 1.413922 -1.722252 19 1.098932
[4000 rows x 5 columns]
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: GLM Df Residuals: 3996
Model Family: Poisson Df Model: 3
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -2743.7
Date: Sat, 08 Jan 2022 Deviance: 141.11
Time: 09:32:32 Pearson chi2: 140.
No. Iterations: 4
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.6857 0.378 9.755 0.000 2.945 4.426
x1 2.0020 0.227 8.800 0.000 1.556 2.448
x2 -3.0393 0.148 -20.604 0.000 -3.328 -2.750
x3 0.9937 0.114 8.719 0.000 0.770 1.217
==============================================================================
Optimization terminated successfully.
Current function value: 0.668293
Iterations 10
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.09462
Time: 09:32:32 Log-Likelihood: -2673.2
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 4.619e-122
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.0000 0.184 10.875 0.000 1.640 2.360
x2 -3.0000 0.124 -24.160 0.000 -3.243 -2.757
x3 1.0000 0.094 10.667 0.000 0.816 1.184
==============================================================================
Optimization terminated successfully.
Current function value: 0.677893
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.08162
Time: 09:32:32 Log-Likelihood: -2711.6
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 2.196e-105
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.9516 0.304 9.711 0.000 2.356 3.547
x2 -2.9801 0.147 -20.275 0.000 -3.268 -2.692
x3 0.9807 0.113 8.655 0.000 0.759 1.203
==============================================================================
Process return 0 (0x0)
Press Enter to continue...