0

I am working with large-scale, imbalanced datasets where I need to pick a stratified training set. However, even if the dataset is strongly imbalanced, I still need to ensure that at least every label class is included at least once in the training set. sklearns train_test_split or StratifiedShuffleSplit will not "guarantee" this inclusion.

Here is an example:

import numpy as np
from sklearn.model_selection import train_test_split

X = np.arange(100).reshape((50, 2))
y = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,4,4]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=4, random_state=42, stratify=y)

print(X_train, y_train)

The result is

[[80 81]
 [48 49]
 [18 19]
 [30 31]] [2, 2, 1, 1]

So the label classes 3 and 4 are not included in this training split. Given the absolute train_size=4, these two classes are not large enough to be included. For a strictly stratified split, this is correct. However, for the smaller classes, I need at least make sure that the algorithm "has seen the label class". Therefore, I need some kind of softening of the stratification principle, and have some kind of proportional inclusion of smaller classes. I have written quite some code to achieve this, which removes smaller classes first, and then handles them separately with a proportional split. However, when removed, this will also influence train_test_split due to the changes in class amounts/total size.

Is there any simple function/algorithm to achieve this behavior?

Andreas
  • 736
  • 6
  • 15

1 Answers1

1

Have you checked sklearn.model_selection.StratifiedKFold? Try setting n_folds to be less than or equal to the number of members in the least populated class. If you have, then I can only recommend using under-/over-sampling methods from imbalanced-learn.

Sanjar Adilov
  • 1,039
  • 7
  • 15
  • Thank your very much for the suggestions. Unfortunately, StratifiedKFold in that case with n_fold=2 will be a 50:50 split, but does not resolve the problem. I looked at imbalanced-learn and found that a RandomUnderSampler with an own implementation of a sampling_strategy using an exponential decrease would solve this problem. However, this own implementation is exactly what I have today using an exponential decrease across the classes (I have used sklearns resample for this). So no real simplification here... – Andreas Jan 09 '21 at 16:01