I am implementing logistic regression to create a bankruptcy prediction model in R. My data consists of financial ratios of many companies which I classified as "bad" having the value 0 and "good" having the value 1.
However, some of the predictor variables seemed to be perfectly separated resulting in the following warning message:
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
To solve this issue I used a form of penalized regression, namely the brglm
package in R.
This resulted in a model with five predictor variables (indicated as X1 - X5):
final_brglm <- brglm(Good1_Bad0 ~ X1 + X2 + X3 + X4 + X5, data = train_data)
The model has a very high accuracy and is based on the following principle:
For the score "Y" (with coefficients B1 - B5)
Y <- intercept + B1*X2 + B2*X2 + B3*X3 + B4*X4 + B5*X5
and predicted probability "pred"
pred <- (exp(Y)/(1+exp(Y))
When Y > 0 the company is "good" and for Y < 1 the company is "bad".
However, the resulting prediction probabilities are either very close to 1 or very close to 0; Y either very large (maxY = 13389261) or very small (minY = -4719827). There is not much in between which makes it difficult to build a score around the model to predict the probability of default/bankruptcy.
This is also indicated by the plot: probability prediction - Y score
I am relatively new to R and I don't know what to do with this. Does it mean the separation problem is not yet solved? I also read something about normalizing the variables, which I have not done since all predictors are financial ratio's (FE sales / assets).