I am working with large-scale, imbalanced datasets where I need to pick a stratified training set. However, even if the dataset is strongly imbalanced, I still need to ensure that at least every label class is included at least once in the training set. sklearns train_test_split or StratifiedShuffleSplit will not "guarantee" this inclusion.
Here is an example:
import numpy as np
from sklearn.model_selection import train_test_split
X = np.arange(100).reshape((50, 2))
y = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,4,4]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=4, random_state=42, stratify=y)
print(X_train, y_train)
The result is
[[80 81]
[48 49]
[18 19]
[30 31]] [2, 2, 1, 1]
So the label classes 3 and 4 are not included in this training split. Given the absolute train_size=4, these two classes are not large enough to be included. For a strictly stratified split, this is correct. However, for the smaller classes, I need at least make sure that the algorithm "has seen the label class". Therefore, I need some kind of softening of the stratification principle, and have some kind of proportional inclusion of smaller classes. I have written quite some code to achieve this, which removes smaller classes first, and then handles them separately with a proportional split. However, when removed, this will also influence train_test_split due to the changes in class amounts/total size.
Is there any simple function/algorithm to achieve this behavior?