I'm working with an extremely class imbalanced data set (the % of positive classes is ~0.1%) and have explored a number of different sampling techniques to help improve the model performance (measured by AUPRC). Since I only have a few thousand positive class examples and several million negative classes, I have mostly explored downsampling. In general, I have found that this approach has resulted in almost no discernible model improvement when evaluated on an unbalanced test set that reflects the true distribution of classes.
However, as an experiment I tried downsampling both the training and test sets, and have found an order of magnitude (10x) increase in performance. This finding has held true for both XGBoost and a simple Fully Connected MLP model.
This to me suggests that the model can in fact distinguish the classes, but I cannot figure out how to adjust the model when trained on a more balanced training set to have a similar performance gain when evaluated on the unbalanced training set. Any suggestions?