0

I am implementing logistic regression to create a bankruptcy prediction model in R. My data consists of financial ratios of many companies which I classified as "bad" having the value 0 and "good" having the value 1.

However, some of the predictor variables seemed to be perfectly separated resulting in the following warning message:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 

To solve this issue I used a form of penalized regression, namely the brglm package in R.

This resulted in a model with five predictor variables (indicated as X1 - X5):

final_brglm <- brglm(Good1_Bad0 ~ X1 + X2 + X3 + X4 + X5, data = train_data)

The model has a very high accuracy and is based on the following principle:

For the score "Y" (with coefficients B1 - B5)

Y <- intercept + B1*X2 + B2*X2 + B3*X3 + B4*X4 + B5*X5

and predicted probability "pred"

pred <- (exp(Y)/(1+exp(Y))

When Y > 0 the company is "good" and for Y < 1 the company is "bad".

However, the resulting prediction probabilities are either very close to 1 or very close to 0; Y either very large (maxY = 13389261) or very small (minY = -4719827). There is not much in between which makes it difficult to build a score around the model to predict the probability of default/bankruptcy.

This is also indicated by the plot: probability prediction - Y score

I am relatively new to R and I don't know what to do with this. Does it mean the separation problem is not yet solved? I also read something about normalizing the variables, which I have not done since all predictors are financial ratio's (FE sales / assets).

smci
  • 32,567
  • 20
  • 113
  • 146
WvO
  • 1
  • 1
  • 1
    This isn't a programming question, it's a statistics question. It should be on stats.stackexchange. – Gregor Thomas Jan 17 '18 at 14:26
  • 1
    Yes you should migrate it to stats.stackexchange.com. Anyway since you do the logistic transform on score Y to get `pred <- (exp(Y)/(1+exp(Y))`, that's going to squash you towards one of two extreme vales. – smci Jan 17 '18 at 14:36
  • Also voted to move to the stats site. But you should add further details. How big is your data, how many bankrupt firms (as this is often rare in datasets), are all the predictors financial ratios? (ps you could look at `logistf` package) – user20650 Jan 17 '18 at 14:37
  • more generally, perhaps of interest https://cran.r-project.org/web/packages/bgeva/index.html . – user20650 Jan 17 '18 at 14:39
  • I'd encourage you to think about things (and how what you've seen makes sense based on your data). You have features that perfectly predict the result - which doesn't work in the vanilla GLM framework, so you penalized the parameters which shrinks them a little toward 0. Basically, the "perfect" parameter estimates are +/- infinity, and you apply a penalty to make them finite (but they'll still be very very large), so when you exponentiate them to get odds ratios you get *almost* 0 and *almost* infinity instead of exactly 0 and infinity. (Using "almost infinity" rather loosely, of course). – Gregor Thomas Jan 17 '18 at 14:55
  • 1
    Also think about your goal and your data. If you really have a few variables that (at least historically) perfectly predict bankruptcy, and you would really know them in advance, then penalizing them is doing yourself a disservice. Your model should consist of some rules (`if` any of these perfectly predictive variables show bankruptcy, `then` predict bankruptcy), and use logistic regression for whatever's left. You might also consider a tree-based model like random forest that will use this information effectively. – Gregor Thomas Jan 17 '18 at 14:59
  • (@Gregor: that's just a decision-tree, combined with LR). If you really have a few variables that (at least historically) perfectly predict bankruptcy, then use a tree-based method like Random Forest already. – smci Jan 17 '18 at 15:32
  • @smci exactly my point. – Gregor Thomas Jan 17 '18 at 15:55
  • @Gregor right but I'm saying no need to manually construct a decision-tree from if-else statements. Just use DT/Random Forest with e.g. max_depth constraint. Then use LR on records which cannot be perfectly classified. (I guess we implement that as three-class classification: 0, 1 and 2:'USE_LR') – smci Jan 17 '18 at 16:20
  • The like cause given that data plot is known as quasi-separation. Do a search. Probably will find on SO and certainly on SE – IRTFM Jan 17 '18 at 22:37

0 Answers0