how to construct stratified tensorflow dataset?

Question

I'm using a custom tensorflow model for an imbalanced classification problem. For this I need to split up the data in a train and test set and split the train set into batches. However the batches need to be stratified due to the imbalance problem. For now I'm doing it like this:

X_train, X_test, y_train, y_test = skmodel.train_test_split(
Xscaled, y_new, test_size=0.2, stratify=y_new)
dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(
    X_train.shape[0]).batch(batch_size)

But I am not sure if the batches in dataset are stratified or not? If not, how can I make sure that they are stratified?

The batches ae definitely not stratified, but you don't really have a simple way to obtain that with tf.data (and besides, depending on he batch size and the positive/negative class ratio, it may be impossible to have exactly stratified classes). The normal approach is to just train with the dataset as-is and maybe consider a loss that wrks better for heavily-unbalanced datasets (e.g., have a look at the focal loss) — GPhilo, Oct 29 '21 at 09:55
thanks! I am using a custom loss function for unbalanced data! Do you know if i only use from tensor slices to create the dataset, the data will stay stratified? — RasM10, Oct 29 '21 at 12:48
The only way I can think of having stratified batches is to have a positive-class dataset, a negative-class dataset and generating batches by taking an appopriate number of samples from each dataset and "manually" making a batch (and even this suffers from the fact that eventually one of the datasets will run out of samples before the other, depending on the exact ratio of positive/negative and the batch size). The short story is: don't bother with stratified batches, just make sure you shuffle your dataset and train long enough. — GPhilo, Oct 29 '21 at 13:05

how to construct stratified tensorflow dataset?

0 Answers0