Sampling for large class and augmentation for small classes in each batch

Question

Let's say we have 2 classes one is small and the second is large.

I would like to use for data augmentation similar to ImageDataGenerator for the small class, and sampling from each batch, in such a way, that, that each batch would be balanced. (Fro minor class- augmentation for major class- sampling).

Also, I would like to continue using image_dataset_from_directory (since the dataset doesn't fit into RAM).

Tou You · Answer 1 · 2021-12-17T13:57:26.603

You can use tf.data.Dataset.from_generator that allows more control on your data generation without loading all your data into RAM.

def generator():
 i=0   
 while True :
   if i%2 == 0:
      elem = large_class_sample()
   else :
      elem =small_class_augmented()

   yield elem
   i=i+1
  

ds= tf.data.Dataset.from_generator(
         generator,
         output_signature=(
             tf.TensorSpec(shape=yourElem_shape , dtype=yourElem_ype))

This generator will alterate samples between the two classes,and you can add more dataset operations(batch , shuffle..)

Alexey Tochin · Accepted Answer · 2021-12-24T15:19:19.390

1

What about sample_from_datasets function?

import tensorflow as tf
from tensorflow.python.data.experimental import sample_from_datasets

def augment(val):
    # Example of augmentation function
    return val - tf.random.uniform(shape=tf.shape(val), maxval=0.1)

big_dataset_size = 1000
small_dataset_size = 10

# Init some datasets
dataset_class_large_positive = tf.data.Dataset.from_tensor_slices(tf.range(100, 100 + big_dataset_size, dtype=tf.float32))
dataset_class_small_negative = tf.data.Dataset.from_tensor_slices(-tf.range(1, 1 + small_dataset_size, dtype=tf.float32))

# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
    .repeat(big_dataset_size // small_dataset_size) \
    .map(augment)

dataset = sample_from_datasets(
    datasets=[dataset_class_large_positive, dataset_class_small_negative], 
    weights=[0.5, 0.5]
)

dataset = dataset.shuffle(100)
dataset = dataset.batch(6)

iterator = dataset.as_numpy_iterator()
for i in range(5):
    print(next(iterator))

# [109.        -10.044552  136.        140.         -1.0505208  -5.0829906]
# [122.        108.        141.         -4.0211563 126.        116.       ]
# [ -4.085523  111.         -7.0003924  -7.027302   -8.0362625  -4.0226436]
# [ -9.039093  118.         -1.0695585 110.        128.         -5.0553837]
# [100.        -2.004463  -9.032592  -8.041705 127.       149.      ]

Set up the desired balance between the classes in the weights parameter of sample_from_datasets.

As it was noticed by Yaoshiang, the last batches are imbalanced and the datasets length are different. This can be avoided by

# Repeat infinitely both datasets and augment the small one
dataset_class_large_positive = dataset_class_large_positive.repeat()
dataset_class_small_negative = dataset_class_small_negative.repeat().map(augment)

instead of

# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
    .repeat(big_dataset_size // small_dataset_size) \
    .map(augment)

This case, however, the dataset is infinite and the number of batches in epoch has to be further controlled.

edited Dec 24 '21 at 15:19

answered Dec 18 '21 at 10:56

Alexey Tochin

653
5
8

How data augmentation can be applied to "small_negative_class" ? – Michael D Dec 19 '21 at 18:51
1

@MichaelD as usual, for example, `dataset_class_small_negative = dataset_class_small_negative.map(augmentation_func)` – Alexey Tochin Dec 20 '21 at 09:48
Does `.repeat()` is necessary here? May it introduce unwanted **over sampling** to a *small class*? – Michael D Dec 21 '21 at 08:59
1

As far as I understand, you need to train on balanced batches built from unbalanced classes. The only way to do that is to oversample small dataset. Thus, `repeat` is natural needed in order to produce the same amount of samples. Probably you just have to augment the dataset after the `repeat`. – Alexey Tochin Dec 21 '21 at 10:00
Agree, I would say that the `repeat` is only needed for the minor class. And repeat should have the ratio to balance between 2 classes. I will slightly modify your answer. Let me know, what do you think. – Michael D Dec 21 '21 at 10:45
You can either apply `repeat(int(dataset_size_ration))` to only small dataset or apply `repeat()` to both dataset and sample batches finite number of times. The second option may be more preferred because all batches are of the same size and epoch length can be modified. – Alexey Tochin Dec 21 '21 at 12:16
I've added `ratio = int(dataset_class_large_positive.cardinality().numpy() / dataset_class_small_negative.cardinality().numpy())` and `dataset = sample_from_datasets(datasets=[dataset_class_large_positive, dataset_class_small_negative.repeat(ratio).map(augment)], weights=[0.5, 0.5])`. But it's waiting for stackoverflow trusted community members approvement. – Michael D Dec 21 '21 at 12:46
1

Suggested my version of your changes. Dataset `carnality` requires processing entire dataset before start that is not desirable. So we have to know the dataset sizes in advance if you do not want to use infinite `repeat`. – Alexey Tochin Dec 21 '21 at 14:18
Good to know regarding `cardinality`. Agree. – Michael D Dec 21 '21 at 22:19
I haven't looked at the implementation but I would avoid this for a few reasons. First, the stochastic approach can lead to different dataset lengths on different epochs if stop_on_empty_dataset is true. Secondly, if attempting to blend say 2 datasets of cardinality 1000, stochastic means the final batches will be dominated by one of the two datasets, leading to a distribution shift. – Yaoshiang Dec 23 '21 at 22:36
1

@Yaoshiang, Both of the problems are solved by simply '.repeat()' applied to both of the datasets instead of `.repeat(int(dataset_size_ration))` to only small dataset. This case, you have to manually restrict sampled number of batches during your epoch. This is discussed few massages above here. The author of this question preferred finite repeat approach. This motivates the current version of the answer. – Alexey Tochin Dec 24 '21 at 15:01
Thanks @AlexeyTochin, you are correct - for rebalanced datasets, adding repeat would solve the hypothetical issue I brought up and the goal of the problem. Thanks for the clarification. – Yaoshiang Dec 27 '21 at 21:48

Yaoshiang · Answer 3 · 2021-12-15T19:06:04.303

0

I didn't totally follow the problem. Would psuedo-code this work? Perhaps there are some operators on tf.data.Dataset that are sufficient to solve your problem.

ds = image_dataset_from_directory(...)

ds1=ds.filter(lambda image, label: label == MAJORITY)
ds2=ds.filter(lambda image, label: label != MAJORITY)

ds2 = ds2.map(lambda image, label: data_augment(image), label)

ds1.batch(int(10. / MAJORITY_RATIO))
ds2.batch(int(10. / MINORITY_RATIO))

ds3 = ds1.zip(ds2)

ds3 = ds3.map(lambda left, right: tf.concat(left, right, axis=0)

edited Dec 15 '21 at 19:06

answered Dec 15 '21 at 00:53

Yaoshiang

1,713
5
15

I will try to convert it to code and test then update. – Michael D Dec 19 '21 at 13:05
Can you please clarify the purpose of `int(10. / MAJORITY_RATIO)` ? I've tried to make a simplie example it didn't work. Something is missing. Maybe the resample for the *large class**. Also, each batch doesn't seem to be balanced. Can you add some example with *range(100)* and *-range(10) as inputs* ? – Michael D Dec 21 '21 at 09:36

score 0 · Answer 4 · answered Dec 21 '21 at 08:38

You can use the tf.data.Dataset.from_tensor_slices to load the images of two categories seperately and do data augmentation for the minority class. Now that you have two datasets combine them with tf.data.Dataset.sample_from_datasets.

# assume class1 is the minority class
files_class1 = glob('class1\\*.jpg')
files_class2 = glob('class2\\*.jpg')

def augment(filepath):
    class_name = tf.strings.split(filepath, os.sep)[0]
    image = tf.io.read_file(filepath)
    image = tf.expand_dims(image, 0)
    if tf.equal(class_name, 'class1'):
        # do all the data augmentation
        image_flip = tf.image.flip_left_right(image)
    return [[image, class_name],[image_flip, class_name]]

# apply data augmentation for class1
train_class1 = tf.data.Dataset.from_tensor_slices(files_class1).\
map(augment,num_parallel_calls=tf.data.AUTOTUNE)
train_class2 = tf.data.Dataset.from_tensor_slices(files_class2)

dataset = tf.python.data.experimental.sample_from_datasets(
datasets=[train_class1,train_class2], 
weights=[0.5, 0.5])

dataset = dataset.batch(BATCH_SIZE)

Sampling for large class and augmentation for small classes in each batch

4 Answers4