Data Augmentation: What proportion of training dataset needs to be augmented?

Question

I am currently working on a speech classification problem. I have 1000 audio files in each class and have 7 such classes. I need to augment data to achieve better accuracy. I am using librosa library for data augmentation. For every audio file, I am using the below code.

fbank_train = []
labels_train = []
for wav in x_train_one[:len(x_train_one)]:
    samples, sample_rate = librosa.load(wav, sr=16000)
    if (len(samples)) == 16000:
        label = wav.split('/')[6]
        fbank = logfbank(samples, sample_rate, nfilt=16)
        fbank_train.append(fbank)
        labels_train.append(label)
        y_shifted = librosa.effects.pitch_shift(samples, sample_rate, n_steps=4, bins_per_octave=24)
        fbank_y_shifted = logfbank(y_shifted, sample_rate, nfilt=16)
        fbank_train.append(fbank_y_shifted)
        labels_train.append(label)
        change_speed = librosa.effects.time_stretch(samples, rate=0.75)
        if(len(change_speed)>=16000):
            change_speed = change_speed[:16000]
            fbank_change_speed = logfbank(change_speed, sample_rate, nfilt=16)
            fbank_train.append(fbank_change_speed)
            labels_train.append(label)
        change_speedp = librosa.effects.time_stretch(samples, rate=1.25)
        if(len(change_speedp)<=16000):
            change_speedp = np.pad(change_speedp, (0, max(0, 16000 - len(change_speedp))), "constant")
            fbank_change_speedp = logfbank(change_speedp, sample_rate, nfilt=16)
            fbank_train.append(fbank_change_speedp)
            labels_train.append(label)

That is I am augmentating each audio file (pitch-shifting and time-shifting). I would like to know, is this the correct way of augmentation of training dataset? And if not, what is the proportion of audio files that need to be augmented?

score 2 · Accepted Answer · answered Dec 06 '19 at 11:31

The most common way of performing augmentation is doing it to the whole dataset with a random chance for each sample to be augmented or not.

Also in most cases, the augmentation is done during runtime.

For example a pseudocode for your case could look like:

for e in epochs:
    reshuffle_training_set
    for x, y in training_set:
        if np.random.random() > 0.5:
            x = randomly_shift_pitch(x)
        if np.random.random() > 0.5:
            x = randomly_shift_time(x)
        model.fit(x, y)

This means that each image has a 25% chance of not being augmented at all, a 25% chance of being only time-shifted, a 25% chance of being only pitch-shifted and a 25% chance of being both time and pitch-shifted.

During the next epoch, that same image is augmented again with the above strategies. If you train your model through multiple epochs, each image will pass through each combination of augmentations (with a high probability), so the model will learn from them all.

Also if each of the shifts is done randomly, even if a sample passed through the same augmentor twice, it wouldn't result in the same perturbed sample.

A benefit of augmenting the images during runtime and not performing the full augmentation beforehand is that if you wanted the same result, you'd need to create multiple new datasets (i.e. a few time-shifted ones, pitch-shifted ones and combinations of both) and train the model on the combined large dataset.

There's not an exact answer to that question, because you have a probability of augmenting each sample. In my pseudocode, for each real sample your model gets trained on, there are 3 generated samples (so the easy answer is the augmented dataset is equivalent to 4 times the size of the original). However, each augmentor works randomly, so if you pass the same sample through the same augmentor multiple times, a different sample comes out each time. In practice the combinations you can have are a lot more. — Djib2011, Dec 06 '19 at 13:51

Data Augmentation: What proportion of training dataset needs to be augmented?

1 Answers1