15

This is the head of a train data set.

Head of the X_Train

Running the below code:

logit = sm.GLM(Y_train, X_train, family=sm.families.Binomial())
result = logit.fit()

Can you please help?

Getting the below error : Error Screen Shot

codeLover
  • 2,571
  • 1
  • 11
  • 27
Dipannita Banerjee
  • 151
  • 1
  • 1
  • 4

5 Answers5

14

Python has detected a complete or quasi-complete separation in one or more of your predictors and the outcome variable.

This happens when all or nearly all of the values in one of the predictor categories (or a combination of predictors) are associated with only one of the binary outcome values. (I'm assuming you're attempting a logistic regression.) When this happens a solution cannot be found for the predictor coefficient.

There are several possible solutions. Depending on how many variables are in your analysis, you can try running two-way crosstabs on your outcome and each of the predictor variables to locate any cells with zero observations, and then drop that variable from the analysis or use fewer categories. Another option is to run a Firth logistic regression or a penalized regression.

RobertF
  • 824
  • 2
  • 14
  • 40
1

Following up on RobertF's nice answer with some more info on running Firth logistic regression.

Firth logistic regression is an effective way to deal with separation, which is observed in your dataset as RobertF explained. See Heinze and Schemper, 2002 for more detail, or see this Medium article for a more casual explanation.

There are easy-to-use packages available in Python (firthlogist) and R (brglm2 and logistf). To demonstrate, I'll use the endometrial cancer dataset that was analyzed in the Heinze and Schemper 2002 paper. In short, there are three features NV (binary), PI (continuous), and EH (continuous), and target HG (binary). There is no observation where NV=1 and HG=0, which results in quasi-complete separation.

Ioannis Kosmidis, the developer of brglm2, also created a package called detectseparation. The below code snippet is taken from the vignette, showing that there is separation in the data and that the maximum likelihood estimate for NV is infinity.

endo_sep <- glm(HG ~ NV + PI + EH, data = endometrial,
                family = binomial("logit"),
                method = "detect_separation")
endo_sep
#> Implementation: ROI | Solver: lpsolve 
#> Separation: TRUE 
#> Existence of maximum likelihood estimates
#> (Intercept)          NV          PI          EH 
#>           0         Inf           0           0 
#> 0: finite value, Inf: infinity, -Inf: -infinity

Let's first try without penalization:

>>> from firthlogist import load_endometrial
>>> import statsmodels.api as sm
>>> X, y, feature_names = load_endometrial()
>>> log_reg = sm.Logit(y, X).fit()
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.350590
         Iterations: 35
~/miniconda3/envs/firth/lib/python3.10/site-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
>>> log_reg.summary(xname=['Intercept'] + feature_names)
<class 'statsmodels.iolib.summary.Summary'>
"""
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   79
Model:                          Logit   Df Residuals:                       75
Method:                           MLE   Df Model:                            3
Date:                Tue, 02 Aug 2022   Pseudo R-squ.:                  0.4720
Time:                        14:16:48   Log-Likelihood:                -27.697
converged:                      False   LL-Null:                       -52.451
Covariance Type:            nonrobust   LLR p-value:                 1.016e-10
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.3045      1.637      2.629      0.009       1.095       7.514
NV            19.5002   5458.415      0.004      0.997   -1.07e+04    1.07e+04
PI            -0.0422      0.044     -0.952      0.341      -0.129       0.045
EH            -2.9026      0.846     -3.433      0.001      -4.560      -1.245
==============================================================================

Possibly complete quasi-separation: A fraction 0.16 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
"""

The reported finite estimate for NV is actually infinite.

In Python, we can use firthlogist to perform Firth logistic regression:

>>> from firthlogist import FirthLogisticRegression, load_endometrial
>>> X, y, feature_names = load_endometrial()
>>> fl = FirthLogisticRegression()
>>> fl.fit(X, y)
FirthLogisticRegression()
>>> fl.summary(xname=feature_names)
                 coef    std err     [0.025      0.975]      p-value
---------  ----------  ---------  ---------  ----------  -----------
NV          2.92927    1.55076     0.609724   7.85463    0.00905243
PI         -0.0347517  0.0395781  -0.124459   0.0404555  0.376022
EH         -2.60416    0.776017   -4.36518   -1.23272    2.50418e-05
Intercept   3.77456    1.48869     1.08254    7.20928    0.00416242

Log-Likelihood: -24.0373
Newton-Raphson iterations: 8

By default, firthlogist uses penalized likelihood ratio tests and profile penalized likelihood confidence intervals which are almost always preferable to Wald tests and confidence intervals with the caveat of being more computationally intensive. To use Wald:

>>> fl = FirthLogisticRegression(wald=True)
>>> fl.fit(X, y)
FirthLogisticRegression(wald=True)
>>> fl.summary(xname=feature_names)
                 coef    std err     [0.025    0.975]      p-value
---------  ----------  ---------  ---------  --------  -----------
NV          2.92927    1.55076    -0.110168   5.96871  0.0589022
PI         -0.0347517  0.0395781  -0.112323   0.04282  0.379915
EH         -2.60416    0.776017   -4.12513   -1.0832   0.000791344
Intercept   3.77456    1.48869     0.856776   6.69234  0.0112291

Log-Likelihood: -24.0373
Newton-Raphson iterations: 8

In R, we can use brglm2:

> library(brglm2)
> fit <- glm(HG~NV+PI+EH, family = binomial(logit), data = brglm2::endometrial, method = "brglmFit")
> summary(fit)

Call:
glm(formula = HG ~ NV + PI + EH, family = binomial(logit), data = brglm2::endometrial, 
    method = "brglmFit")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4740  -0.6706  -0.3411   0.3252   2.6123  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.77456    1.48869   2.535 0.011229 *  
NV           2.92927    1.55076   1.889 0.058902 .  
PI          -0.03475    0.03958  -0.878 0.379914    
EH          -2.60416    0.77602  -3.356 0.000791 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 104.903  on 78  degrees of freedom
Residual deviance:  56.575  on 75  degrees of freedom
AIC:  64.575

Type of estimator: AS_mixed (mixed bias-reducing adjusted score equations)
Number of Fisher Scoring iterations: 6

or logistf:

> library(logistf)
> fit <- logistf(HG~NV+PI+EH, data=brglm2::endometrial)
> summary(fit)
logistf(formula = HG ~ NV + PI + EH, data = brglm2::endometrial)

Model fitted by Penalized ML
Coefficients:
                   coef   se(coef) lower 0.95  upper 0.95      Chisq            p method
(Intercept)  3.77455951 1.43900672  1.0825371  7.20928050  8.1980136 4.193628e-03      2
NV           2.92927330 1.46497415  0.6097244  7.85463171  6.7984572 9.123668e-03      2
PI          -0.03475175 0.03789237 -0.1244587  0.04045547  0.7468285 3.874822e-01      2
EH          -2.60416387 0.75362838 -4.3651832 -1.23272106 17.7593175 2.506867e-05      2

Method: 1-Wald, 2-Profile penalized log-likelihood, 3-None

Likelihood ratio test=43.65582 on 3 df, p=1.78586e-09, n=79
Wald test = 21.66965 on 3 df, p = 7.641345e-05
jon
  • 21
  • 4
1

This can be experienced by beginners when they perform an incorrect operation on the target and thereby getting same values in the target variable.

eg:

  • If your target is "Up" or "Down" and you want to convert it to binary values and mistakenly used "UP"(Please note the case of "P") as shown below:
  • df["Direction"] = pd.Series(np.where(df.Direction.values == "UP", 1, 0), df.index)

you will end up having all the target "Direction" values as 0's and this error shows up.

1

Probably this may help someone who is a noob like me!

Make sure that you are not including the target along with the predictors. I accidentally included the target along with the predictors and struggled with this for a long time, for such a silly mistake.

Explanation: Since the target which you included along with the predictors is in perfect correlation with itself, it would give you a Perfect separation detected error.

Benison Sam
  • 2,755
  • 7
  • 30
  • 40
0

In logistic regression, whenever perfect separation error occurs

  1. Find correlation of target variable and predictors, build the heat map
  2. Try to understand their collinearity w.r.t. target variable, the predictor which is having lowest collinearity, drop that column from data frame
  3. Build the model
betelgeuse
  • 1,136
  • 3
  • 13
  • 25