Should I balance the test set for evaluating a model?

Question

I have to evaluate logistic regression model. This model is aimed to detect frouds, so in real life the algorithm will face highly imbalanced data.

Some people say that I need to balance train set only, while test set should remain similar to real life data. On the other hand, many people say that model must be trained and tested on balanced samples.

I tried to test my model for both (balanced, unbalanced) sets and get the same ROC AUC (0.73), but different precision-recall curve AUC - 0.4 (for unbalanced) and 0.74 for (balanced).

What shoud I choose?

And what metrics should I use to evaluate my model perfomance?

NEVER EVER DO THAT. Always have a test set which best represent your deployment scenarios in real life and a corresponding metric. — Abhishek Prajapat, Aug 12 '21 at 20:37
I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology. — desertnaut, Aug 12 '21 at 23:54
https://stackoverflow.com/questions/48805063/balance-classes-in-cross-validation , https://datascience.stackexchange.com/questions/82073/why-you-shouldnt-upsample-before-cross-validation — desertnaut, Aug 12 '21 at 23:56

score 2 · Accepted Answer · answered Aug 12 '21 at 20:46

Since you are dealing with a problem that has an unbalanced concept (a disproportionately greater amount of not-fraud over fraud), I recommend you utilize F-scoring with a real-world "matching" unbalanced set. This would allow you to compare models without having to ensure that your test set is balanced, since that could mean that you are overrepresenting frauds and underrepresenting non-frauds cases in your test set.

Here are some references and how to implement on sklearn:
https://en.wikipedia.org/wiki/F-score
https://deepai.org/machine-learning-glossary-and-terms/f-score
https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Many thanks :) But, f-score expects probability threshold to be set. In my case, I want to evaluate my model on all possible thresholds. Otherwise I will have to loop through 100 different thresholds which is not good idea I think — mad_scientist, Aug 12 '21 at 20:58

Should I balance the test set for evaluating a model?

1 Answers1