This is the head of a train data set.
Running the below code:
logit = sm.GLM(Y_train, X_train, family=sm.families.Binomial())
result = logit.fit()
Can you please help?
Getting the below error : Error Screen Shot
This is the head of a train data set.
Running the below code:
logit = sm.GLM(Y_train, X_train, family=sm.families.Binomial())
result = logit.fit()
Can you please help?
Getting the below error : Error Screen Shot
Python has detected a complete or quasi-complete separation in one or more of your predictors and the outcome variable.
This happens when all or nearly all of the values in one of the predictor categories (or a combination of predictors) are associated with only one of the binary outcome values. (I'm assuming you're attempting a logistic regression.) When this happens a solution cannot be found for the predictor coefficient.
There are several possible solutions. Depending on how many variables are in your analysis, you can try running two-way crosstabs on your outcome and each of the predictor variables to locate any cells with zero observations, and then drop that variable from the analysis or use fewer categories. Another option is to run a Firth logistic regression or a penalized regression.
Following up on RobertF's nice answer with some more info on running Firth logistic regression.
Firth logistic regression is an effective way to deal with separation, which is observed in your dataset as RobertF explained. See Heinze and Schemper, 2002 for more detail, or see this Medium article for a more casual explanation.
There are easy-to-use packages available in Python (firthlogist) and R (brglm2 and logistf). To demonstrate, I'll use the endometrial cancer dataset that was analyzed in the Heinze and Schemper 2002 paper. In short, there are three features NV
(binary), PI
(continuous), and EH
(continuous), and target HG
(binary). There is no observation where NV=1
and HG=0
, which results in quasi-complete separation.
Ioannis Kosmidis, the developer of brglm2, also created a package called detectseparation. The below code snippet is taken from the vignette, showing that there is separation in the data and that the maximum likelihood estimate for NV
is infinity.
endo_sep <- glm(HG ~ NV + PI + EH, data = endometrial,
family = binomial("logit"),
method = "detect_separation")
endo_sep
#> Implementation: ROI | Solver: lpsolve
#> Separation: TRUE
#> Existence of maximum likelihood estimates
#> (Intercept) NV PI EH
#> 0 Inf 0 0
#> 0: finite value, Inf: infinity, -Inf: -infinity
Let's first try without penalization:
>>> from firthlogist import load_endometrial
>>> import statsmodels.api as sm
>>> X, y, feature_names = load_endometrial()
>>> log_reg = sm.Logit(y, X).fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.350590
Iterations: 35
~/miniconda3/envs/firth/lib/python3.10/site-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
>>> log_reg.summary(xname=['Intercept'] + feature_names)
<class 'statsmodels.iolib.summary.Summary'>
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 79
Model: Logit Df Residuals: 75
Method: MLE Df Model: 3
Date: Tue, 02 Aug 2022 Pseudo R-squ.: 0.4720
Time: 14:16:48 Log-Likelihood: -27.697
converged: False LL-Null: -52.451
Covariance Type: nonrobust LLR p-value: 1.016e-10
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 4.3045 1.637 2.629 0.009 1.095 7.514
NV 19.5002 5458.415 0.004 0.997 -1.07e+04 1.07e+04
PI -0.0422 0.044 -0.952 0.341 -0.129 0.045
EH -2.9026 0.846 -3.433 0.001 -4.560 -1.245
==============================================================================
Possibly complete quasi-separation: A fraction 0.16 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
"""
The reported finite estimate for NV
is actually infinite.
In Python, we can use firthlogist to perform Firth logistic regression:
>>> from firthlogist import FirthLogisticRegression, load_endometrial
>>> X, y, feature_names = load_endometrial()
>>> fl = FirthLogisticRegression()
>>> fl.fit(X, y)
FirthLogisticRegression()
>>> fl.summary(xname=feature_names)
coef std err [0.025 0.975] p-value
--------- ---------- --------- --------- ---------- -----------
NV 2.92927 1.55076 0.609724 7.85463 0.00905243
PI -0.0347517 0.0395781 -0.124459 0.0404555 0.376022
EH -2.60416 0.776017 -4.36518 -1.23272 2.50418e-05
Intercept 3.77456 1.48869 1.08254 7.20928 0.00416242
Log-Likelihood: -24.0373
Newton-Raphson iterations: 8
By default, firthlogist uses penalized likelihood ratio tests and profile penalized likelihood confidence intervals which are almost always preferable to Wald tests and confidence intervals with the caveat of being more computationally intensive. To use Wald:
>>> fl = FirthLogisticRegression(wald=True)
>>> fl.fit(X, y)
FirthLogisticRegression(wald=True)
>>> fl.summary(xname=feature_names)
coef std err [0.025 0.975] p-value
--------- ---------- --------- --------- -------- -----------
NV 2.92927 1.55076 -0.110168 5.96871 0.0589022
PI -0.0347517 0.0395781 -0.112323 0.04282 0.379915
EH -2.60416 0.776017 -4.12513 -1.0832 0.000791344
Intercept 3.77456 1.48869 0.856776 6.69234 0.0112291
Log-Likelihood: -24.0373
Newton-Raphson iterations: 8
In R, we can use brglm2:
> library(brglm2)
> fit <- glm(HG~NV+PI+EH, family = binomial(logit), data = brglm2::endometrial, method = "brglmFit")
> summary(fit)
Call:
glm(formula = HG ~ NV + PI + EH, family = binomial(logit), data = brglm2::endometrial,
method = "brglmFit")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4740 -0.6706 -0.3411 0.3252 2.6123
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.77456 1.48869 2.535 0.011229 *
NV 2.92927 1.55076 1.889 0.058902 .
PI -0.03475 0.03958 -0.878 0.379914
EH -2.60416 0.77602 -3.356 0.000791 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 104.903 on 78 degrees of freedom
Residual deviance: 56.575 on 75 degrees of freedom
AIC: 64.575
Type of estimator: AS_mixed (mixed bias-reducing adjusted score equations)
Number of Fisher Scoring iterations: 6
or logistf:
> library(logistf)
> fit <- logistf(HG~NV+PI+EH, data=brglm2::endometrial)
> summary(fit)
logistf(formula = HG ~ NV + PI + EH, data = brglm2::endometrial)
Model fitted by Penalized ML
Coefficients:
coef se(coef) lower 0.95 upper 0.95 Chisq p method
(Intercept) 3.77455951 1.43900672 1.0825371 7.20928050 8.1980136 4.193628e-03 2
NV 2.92927330 1.46497415 0.6097244 7.85463171 6.7984572 9.123668e-03 2
PI -0.03475175 0.03789237 -0.1244587 0.04045547 0.7468285 3.874822e-01 2
EH -2.60416387 0.75362838 -4.3651832 -1.23272106 17.7593175 2.506867e-05 2
Method: 1-Wald, 2-Profile penalized log-likelihood, 3-None
Likelihood ratio test=43.65582 on 3 df, p=1.78586e-09, n=79
Wald test = 21.66965 on 3 df, p = 7.641345e-05
This can be experienced by beginners when they perform an incorrect operation on the target and thereby getting same values in the target variable.
eg:
you will end up having all the target "Direction" values as 0's and this error shows up.
Probably this may help someone who is a noob like me!
Make sure that you are not including the target
along with the predictors
. I accidentally included the target
along with the predictors
and struggled with this for a long time, for such a silly mistake.
Explanation: Since the target
which you included along with the predictors
is in perfect correlation with itself, it would give you a Perfect separation detected error.
In logistic regression, whenever perfect separation error occurs