0

I am currently building a 1D-CNN for classification. The predictors are spectra (X-matrix with 779 features), and the dependent variable contains two classes.

However, the X-matrix contains repeated measurements (series of 15-20 replicates). It is crucial that during training repeated measurements are not included both in the sets for training and evaluation of the loss function. Is there a way to build "custom" mini-batches which would avoid this?

Matthew Strawbridge
  • 19,940
  • 10
  • 72
  • 93
Petar
  • 3
  • 1

1 Answers1

0

You should try using data generators.

A DataGenerator is an object that takes as input the X_train and y_train matrices and put the samples into batches following some criterion. It can also be used to handle large volumes of data that cannot be loaded at once on the virtual memory.

Here is an example on how to implement one !

Basically get_item will give you your next batch so that's the place to implement all the conditions you might need.

import numpy as np
import keras

class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, X, labels, batch_size=32, dim=(32,32,32), n_channels=1,
                 n_classes=10, shuffle=True):
        'Initialization'
        self.dim = dim
        self.batch_size = batch_size
        self.labels = labels
        self.X = X
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.X) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch to make sure samples dont repeat
        list_IDs_temp = ... your code

        # Generate data
        X, y = self.__data_generation(list_IDs_temp)

        return X, y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.X))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, list_IDs_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, *self.dim, self.n_channels))
        y = np.empty((self.batch_size), dtype=int)

        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            X[i,] = self.X[ID,]

            # Store class
            y[i] = self.labels[ID]

        return X, keras.utils.to_categorical(y, num_classes=self.n_classes)

Source: This

Gabriel M
  • 1,486
  • 4
  • 17
  • 25
  • Thank you. Seems very reasonable. This implementation can also be used for leave-one-object-out cross-validation, where the object is one series of repeated measurements. Will update after implementing this into my code. – Petar Oct 21 '18 at 07:57
  • Yeah I guess you can adapt this class to do some sort of cross validation. However it is mainly intended to cherry pick batches. Let me know if you encounter any problems ! – Gabriel M Oct 22 '18 at 08:05
  • Let me know if I got this correctly. __getitem__ creates a new batch of indices, which are passed to __data_generation to create training x, and y. The rest of the samples are for testing? I added an additional input to the class called siteLabel which contains all the labels of the training set. Repeated measurements have identical labels and represent one site, but the size of one site is not always the same. My solution is to define batch_size as the number of sites and generate random site indices. Then unfold them into indices of repeated measurements and feed into __data_generation. – Petar Oct 22 '18 at 15:53
  • It is kind of the opposite. __getitem__ is the function called by keras to get the next batch. The __data_generation function is called only by __getitem__ and you may not even need to define this function. They use it to make everything easier to read. In their case __getitem__ generates the ids of the samples that will fit the batch, and __data__generator gets this ids and fill X and y. I'll edit my answer to be more clear about this – Gabriel M Oct 23 '18 at 07:37
  • The rest of the samples are not for test. Assume you may have 300 training samples in batches of 30, this will make 10 batches, so each time this function is called it returns 1 batch of 30 samples. It is up to you to decide which samples you pass. You may always pass the same 30 samples but in that case it is an error. Because you are not using the whole data you provide for training. – Gabriel M Oct 23 '18 at 07:41
  • Otherwise the solution you describe seems perfect to me. Remember that this is not the function that makes the train/test splits. That is something you handle before the training. The X matrix that you provide to this object is your whole training data and the function will put it into batches following some criterion – Gabriel M Oct 23 '18 at 07:45