3

Background - The dataset I am working on is highly imbalanced and the number of classes is 543. The data is bounded by date. After exploring the data over a span of 5 years I came to know the imbalance is inherent and its persistent. The test data which the model will get will also be bounded by a date range and it will also have a similar imbalance.

The reason for the imbalance in the data is different amount of spend, popularity of a product. Handling imbalance would do injustice to the business.

Questions - In such a case, is it okay to proceed with building model on imbalanced data?

The model would be retrained every month on the new data and it would be used for predictions once in a month.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
learnToCode
  • 341
  • 4
  • 14
  • 1
    It also depends on the percentage of imbalance. A simple solution would be to sample separately from the underrepresented and the fully represented such that on each training you have a fixed random percentage of the underrepresented present. Other Bootstrapping methods might come in handy here too. Best – smile Jul 19 '20 at 13:29
  • The imbalance is huge. One class has around 20k observations and the other has around 50 observations. And this is the case with almost all the classes in the target variable. Correct me if I am wrong, You mean to say pick-up a fixed number of observations from the minority and the same number of observations from the majority, then create a training dataset using the two samples. Also, wouldn't applying bootstrapping, in this case, would be the same as random oversampling? – learnToCode Jul 22 '20 at 05:33
  • 1
    With this amount of imbalance one would be better of using statistical models such as lightgbm, xgboost, or ensemble model. Deep learning is data intensive in other to guarantee at least a good performance. The target would be to have the underrepresented data samples in the training as much as the fully represented data hence why I suggested bootstrapping. It is what one would think of when such imbalance exist. It is like I will iteratively sample from the underrepresented a fixed amount then sample randomly from the fully represented. Then train iteratively on this. Best – smile Jul 22 '20 at 05:43
  • 1
    I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). – desertnaut Apr 19 '21 at 08:43

1 Answers1

3

Depending on what you are trying to model, it may or may not be correct to do so.

Training on an imbalanced dataset will generally make your model overfit those elements that appear more often, which leads to bias towards those ones at best or no understanding of the underrepresented samples at worst. If you are trying to model the natural occurrences of some information, then an unbalanced dataset in essence has a prior probability applied to it already, so the resulting bias may be desired. In these cases, the number of elements per class, say, is part of the actual information. Such a bias can be (un-)modeled artificially too, however, e.g. by applying a scaling factor for classification (e.g. through class weights), etc. To avoid such bias, boosting and ensemble methods such as Xgboost (or Adaboost in more trivial cases) or just Random Forests work relatively well. If you have the time, k-fold cross validation can help reducing the error further.

To make sure every sample is adequately represented, you may choose to oversample the underrepresented classes or undersample the overrepresented ones. In order to determine correct likelihoods, make sure to capture the prior distribution as well and use it to shape the posterior. Data augmentation may help you if the number of samples is low; depending on your case, synthetic data generation might be a good approach. You could try, say, training a GAN only on the underrepresented samples and use that to generate more - as in idea: train it on all available data first, then change the discriminator loss to force it to forge and recognize the underrepresented classes only. Without entering the Deep Learning domain, techniques such as SMOTE or ADASYN may work. Both are available in the imblearn Python package which builds on scikit-learn.

Lastly, carefully selecting the loss metric may help. You can find more (and more detailed) information in papers such as Survey on deep learning with class imbalance.

sunside
  • 8,069
  • 9
  • 51
  • 74
  • What do you mean by scaling factor for classification? is it something similar to class weights. – learnToCode Jul 11 '20 at 07:29
  • The client has asked to refrain from using Deep learning. Also, the imbalance is too high. One class has around 20k observations and the other has around 50 observations. And this is the case with almost all the classes in the target variable. – learnToCode Jul 22 '20 at 05:31
  • @learnToCode My bad, I didn't think of just using statistical methods. A comment already suggested bootstrapping and ensemble methods; there's some answer [over here](https://stats.stackexchange.com/questions/345932/bootstrapping-dataset-with-imbalanced-classes). Did you look at the [`imblearn`](https://github.com/scikit-learn-contrib/imbalanced-learn/) package already? Generative techniques such as SMOTE might work for you. – sunside Jul 22 '20 at 18:49
  • Yup. I had tried using SMOTE and its other variations but the problem is when I use oversampling, I have 543 classes and the class with max frequency has 20k observations. So after oversampling all the classes would have 20k samples i.e. 543x20k.number of samples. This would increase the data size tremendously and to be honest, when I tried this, the environment crashed ( I was working on Azure cloud) – learnToCode Jul 23 '20 at 08:12
  • I see. Did you try undersampling the overrepresented classes as well in order to reduce the total number of samples? In any case, an ensemble method composed of weak learners may just be the way to go then, as you won't necessarily need to know all the samples at every point in time to construct them. As long as your model turns out to be somewhat better than what you had before (for some training examples), you can treat it as a weak learner itself to form a more powerful ensemble. – sunside Jul 23 '20 at 11:12
  • I was hesitant to undersampling as the max frequency was 20k and min was 50. So undersampling will bring observations to frequency as 50 and it would lead to loss of too much data, 20k to 50. I hope you get my concern, and so I tried to investigate the data, looked if the imbalance is persistent. I found that a similar imbalance would be coming in the test data so thought of building a model on imbalanced data as the client was more focussed on accuracy than any other metrics. – learnToCode Jul 24 '20 at 06:45