I'm using Auto-Sklearn and have a dataset with 42 classes that are heavily imbalanced. What is the best way to handle this imbalance? As far as I know, two approaches to handle imbalanced data within machine learning exist. Either using a resampling mechanism such as over- or under-sampling (or a combination of both) or to solve it on an algorithmic-level by choosing an inductive bias that would require in-depth knowledge about the algorithms used within Auto-Sklearn. I'm not quite sure on how to handle this problem. Is it anyhow possible to solve the imbalance directly within Auto-Sklearn or do I need to use resampling strategies as offered by e.g. imbalanced-learn? Which evaluation metric should be used after the models have been computed? The roc_auc_score for multiple classes is available since sklearn==0.22.1. However, Auto-Sklearn only supports sklearn up to version 0.21.3. Thanks in advance!
3 Answers
The other method is to set weights for classes according to their size. Effort is very little and it seems to work fine. I was looking for setting weights in auto-sklearn and this is what I have found:
https://github.com/automl/auto-sklearn/issues/113
For example in scikit svm you have parameter 'class_weight':
https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html
I hope this helps :)

- 76
- 4
One way that has worked for me in the past to handle highly imbalanced datasets is Synthetic Minority Oversampling Technique (SMOTE). Here is the paper for better understanding:
This works by synthetically oversampling the minority class or classes for that matter. To quote the paper:
The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen.
This then will move closer towards balancing out your dataset. There is an implementation of SMOTE in the imblearn package in python.
Here is a good read about different oversampling algorithms. It includes oversampling using ADASYN as well as SMOTE.
I hope this helps.

- 2,493
- 2
- 17
- 31
-
1Thanks a lot, this was really helpful! Did you also use SMOTE before training a classifier with Auto-Sklearn or did you use another ML pipeline? And do you know by any chance which metric could potentially be used other than the roc_auc_score? All the papers I had a look in so far considered the roc_auc_score but unfortunately this can't be used at the moment within auto-sklearn. The amount of different metrics and sampling strategies is overwhelming for a newbie. :D – MoDo Feb 20 '20 at 21:23
-
@MoDo Thank you :) I used SMOTE before training a classifier with a different ML pipeline. As for the metric, I'm not fully certain because I may have used other metrics that suited my use case at the time more than the `roc_auc_score`. I can't seem to recall unfortunately. I fully understand that the number of different metrics and sampling strategies is overwhelming. Once you start doing this every single day, it becomes a lot easier, just like everything else :) – Rahul P Feb 20 '20 at 21:40
For those interested and as an addition to the answers given, I can highly recommend the following paper:
Lemnaru, C., & Potolea, R. (2011, June). Imbalanced classification problems: systematic study, issues and best practices. In International Conference on Enterprise Information Systems (pp. 35-50). Springer, Berlin, Heidelberg.
The authors argue that:
In terms of solutions, since the performance is not expected to improve significantly with a more sophisticated sampling strategy, more focus should be allocated to algorithm related improvements, rather than to data improvements.
As e.g. the ChaLearn AutoML Challenge 2015 used the balanced accuracy, sklearn argues that it is a fitting metric for imbalanced data and Auto-Sklearn was able to compute well-fitting models, I'm going to have a try. Even without resampling, the results were much "better" (in terms of prediction quality) than just using the accuracy.

- 200
- 1
- 11