Scikit-learn: Strong imbalance between false-positives and false-negatives

Question

Using scikit-learn on balanced training data of around 50 millions samples (50% one class, 50% the other, 8 continuous features in interval (0,1)), all classifiers that I have been able to try so far (Linear/LogisticRegression, LinearSVC, RandomForestClassifier, ...) show a strange behavior:

When testing on the training data, the percentage of false-positives is much lower than the percentage of false-negatives (fnr). When correcting the intercept manually in order to increase false-positive rate (fpr), the accuracy actually improves considerably.

Why do the classification algorithms not find a close-to-optimal intercept (that I guess would more or less be at fpr=fnr)?

Logistic regression optimizes the log loss. The log loss might be minimized in a situation with high false negatives, or in a situation with high false positives, or in a situation with low accuracy, it just depends on what your data are like. If you want to optimize accuracy, then optimize accuracy instead of log loss. — Him, Apr 22 '23 at 04:08

score 0 · Answer 1 · answered Nov 07 '16 at 22:04

I guess the idea is that there's no single definition of "optimal"; for some applications, you'll tolerate false positives much more than false negatives (i.e. detecting fraud or disease where you don't want to miss a positive) whereas for other applications false positives are much worse (predicting equipment failures, crimes, or something else where the cost of taking action is expensive). By default, predict just chooses 0.5 as the threshold, this is usually not what you want, you need think about your application and then look at the ROC curve and the gains/lift charts to decide where you want to set the prediction threshold.

Scikit-learn: Strong imbalance between false-positives and false-negatives

1 Answers1