Using scikit-learn
on balanced training data of around 50 millions samples (50% one class, 50% the other, 8 continuous features in interval (0,1)
), all classifiers that I have been able to try so far (Linear/LogisticRegression, LinearSVC
, RandomForestClassifier
, ...) show a strange behavior:
When testing on the training data, the percentage of false-positives is much lower than the percentage of false-negatives (fnr). When correcting the intercept manually in order to increase false-positive rate (fpr), the accuracy actually improves considerably.
Why do the classification algorithms not find a close-to-optimal intercept (that I guess would more or less be at fpr=fnr)?