Keras custom data generator for large hdf5 file which does not fit into memory

Question

I'm trying to use the pretrained InceptionV3 model to classify the food-101 dataset, which containts food images for 101 categories, 1000 per category. I've preprocessed this dataset into a single hdf5 file (I assumed this is beneficial compared to loading images on the go when training) so far, which has the following tables inside:

The data split is the standard 70% train, 20% validation, 10% test, so for example the valid_img has a size of 20200*299*299*3. The labels are onehotencoded for Keras, so valid_labels has a size of 20200*101.

This hdf5 file has a size of 27.1 GB, so it will not fit into my memory. (Have 8 GB of it, although effectively only probably 4-5 gigs is usable while running Ubuntu. Also my GPU is GTX 960 with 2 GB of VRAM, and so far it looked like 1.5 GB is available for python when I try to start the training script). I'm using Tensorflow backend.

The first idea I had is to use model.train_on_batch() with a double nested for loop like this:

#Loading InceptionV3, adding my fully connected layers, compiling model...    

dataset = h5py.File('/home/uzoltan/PycharmProjects/food-101/food-101_299x299.hdf5', 'r')
    epoch = 50
    for i in range(epoch):
        for i in range(100): #1000 images can fit in the memory easily, this could probably be range(10) too
            train_images = dataset["train_img"][i * 706:(i + 1) * 706, ...]
            train_labels = dataset["train_labels"][i * 706:(i + 1) * 706, ...]
            val_images = dataset["valid_img"][i * 202:(i + 1) * 202, ...]
            val_labels = dataset["valid_labels"][i * 202:(i + 1) * 202, ...]
            model.train_on_batch(x=train_images, y=train_labels, class_weight=None,
                                 sample_weight=None, )

My problem with this approach is that train_on_batch provides 0 support for validation or batch shuffling, so that the batches are not in the same order every epoch.

So I looked towards model.fit_generator() which has the nice property of providing all the same functionality as fit(), plus with the built in ImageDataGenerator you can do image augmentations (rotations, horizontal flips, etc.) at the same time with the CPU, so that your model can be more robust. My problem here is, that if I understand it correctly, the ImageDataGenerator.flow(x,y) method needs all the samples and labels at once, but my training/validation data wont fit into my RAM.

Here is where I think custom data generators come into the picture, but after looking extensively at some examples I could find on the Keras GitHub/Issues page, I still dont really get how should I implement a custom generator, which would read in batches of data from my hdf5 file. Can someone provide me with a good example or pointers? How could I couple the custom batch generator with the image augmentations? Or maybe is it easier to implement some kind of manual validation and batch shuffling for train_on_batch()? If so, I could use some pointer there too.

Why cannot you simply extract all files to separate directories and use [`flow_from_directory`](https://keras.io/preprocessing/image/) function? — Marcin Możejko, Nov 01 '17 at 17:21

score 3 · Answer 1 · answered May 22 '19 at 08:19

For anyone still looking for an answer, I made the following "crude wrapper" around ImageDataGeneator's apply_transform method.

from numpy.random import uniform, randint
from tensorflow.python.keras.preprocessing.image import ImageDataGenerator
import numpy as np

class CustomImagesGenerator:
    def __init__(self, x, zoom_range, shear_range, rescale, horizontal_flip, batch_size):
        self.x = x
        self.zoom_range = zoom_range
        self.shear_range = shear_range
        self.rescale = rescale
        self.horizontal_flip = horizontal_flip
        self.batch_size = batch_size
        self.__img_gen = ImageDataGenerator()
        self.__batch_index = 0

    def __len__(self):
        # steps_per_epoch, if unspecified, will use the len(generator) as a number of steps.
        # hence this
        return np.floor(self.x.shape[0]/self.batch_size)

    def next(self):
        return self.__next__()

    def __next__(self):
        start = self.__batch_index*self.batch_size
        stop = start + self.batch_size
        self.__batch_index += 1
        if stop > len(self.x):
            raise StopIteration
        transformed = np.array(self.x[start:stop])  # loads from hdf5
        for i in range(len(transformed)):
            zoom = uniform(self.zoom_range[0], self.zoom_range[1])
            transformations = {
                'zx': zoom,
                'zy': zoom,
                'shear': uniform(-self.shear_range, self.shear_range),
                'flip_horizontal': self.horizontal_flip and bool(randint(0,2))
            }
            transformed[i] = self.__img_gen.apply_transform(transformed[i], transformations)
        return transformed * self.rescale

It can be called like so:

import h5py
f = h5py.File("my_heavy_dataset_file.hdf5", 'r')
images = f['mydatasets/images']

my_gen = CustomImagesGenerator(
    images, 
    zoom_range=[0.8, 1], 
    shear_range=6, 
    rescale=1./255, 
    horizontal_flip=True, 
    batch_size=64
)

model.fit_generator(my_gen)

I can't really try it out anymore, to accept it, this was 1.5 years ago for me ^^ But thanks, hopefully it will help someone :) — logi0517, May 23 '19 at 09:43
@Sam Your code looks really promising! I tried it out, but it gives me the error "'CustomImagesGenerator' object has no attribute 'shape'". You can have a look at how I implemented your code here: https://colab.research.google.com/drive/1nMa33mldd5wq6Pqb06AWTU1QRNfS48sy — NeStack, Aug 28 '19 at 16:52
@NeStack I would say you'd have to add a `@property` method for `shape` [like so](https://pastebin.com/b5qe54KX) let me know if it still doesn't work — Samuel Prevost, Aug 29 '19 at 06:13
@Sam The error messages went away when I did `from tensorflow.python.keras.utils.data_utils import Sequence` and then changed your code to `class CustomImagesGenerator(Sequence):`. Is this legitimate? Afterwards the code keeps running forever with the error message "...File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/data_utils.py", line 742, in _run sequence = list(range(len(self.sequence))) TypeError: 'numpy.float64' object cannot be interpreted as an integer". Any idea what is wrong? — NeStack, Aug 29 '19 at 15:58
Sorry, but I don't use Keras/TF at the moment anymore so I'm quite clueless. Have you tried what I sent you in my previous answer instead of trying other approaches and seeking my help on them ? — Samuel Prevost, Sep 05 '19 at 07:27

score 2 · Answer 2 · answered Jul 04 '18 at 18:23

2

If I understood you correctly, you want to use the data (which does not fit in the memory) from HDF5 and at the same time use data augmentation on it.

I'm in the same situation as you, and I found this code that maybe can be helpful with some few modifications:

https://gist.github.com/wassname/74f02bc9134897e3fe4e60784f5aaa15

answered Jul 04 '18 at 18:23

Helder

482
5
18

Just for future readers, the code linked above will load the whole HDF5 dataset in memory. HDF5Matrix class doesn't load it but the ImageDataGenerator will load the whole thing when .flow is called. – Samuel Prevost May 22 '19 at 07:29

score 0 · Answer 3 · edited Jun 04 '18 at 20:30

this is my solution for shuffle data per epoch with h5 file. indices means train or val index list.

def generator(h5path, indices, batchSize=128, is_train=True, aug=None):

    db = h5py.File(h5path, "r")
    with open("mean.json") as f:
        mean = json.load(f)
    meanV = np.array([mean["R"], mean["G"], mean["B"]])

    while True:

        np.random.shuffle(indices)
        for i in range(0, len(indices), batchSize):
            t0 = time()
            batch_indices = indices[i:i+batchSize]
            batch_indices.sort()

            by = db["labels"][batch_indices,:]
            bx = db["images"][batch_indices,:,:,:]

            bx[:,:,:,0] -= meanV[0]
            bx[:,:,:,1] -= meanV[1]
            bx[:,:,:,2] -= meanV[2]
            t1=time()

            if is_train:

                #bx = random_crop(bx, (224,224))
                if aug is not None:
                    bx,by = next(aug.flow(bx,by,batchSize))

            yield (bx,by)


h5path='all_224.hdf5'   
model.fit_generator(generator(h5path, train_indices, batchSize=batchSize, is_train=True, aug=aug),
                steps_per_epoch = 20000//batchSize,
                validation_data= generator(h5path, test_indices, is_train=False, batchSize=batchSize), 
                validation_steps = 2424//batchSize,
                epochs=args.epoch, 
                max_queue_size=100,
                callbacks=[checkpoint, early_stop])

score -1 · Answer 4 · answered Nov 01 '17 at 17:23

-1

You want to write a function which loads images from the HDF5 and then yields (not returns) them as a numpy array. Here is a simple example which uses OpenCV to load images directly from .png/.jpg files in a given directory:

def generate_data(directory, batch_size):
    """Replaces Keras' native ImageDataGenerator."""
    i = 0
    file_list = os.listdir(directory)
    while True:
        image_batch = []
        for b in range(batch_size):
            if i == len(file_list):
                i = 0
                random.shuffle(file_list)
            sample = file_list[i]
            i += 1
            image = cv2.resize(cv2.imread(sample[0]), INPUT_SHAPE)
            image_batch.append((image.astype(float) - 128) / 128)

        yield np.array(image_batch)

Obviously you will have to modify it to read from the HDF5 instead.

Once you have written your function, the usage is simply:

model.fit_generator(
generate_data('~/my_data', batch_size),
steps_per_epoch=len(os.listdir('~/my_data')) // batch_size)

Again modified to reflect the fact that you are reading from an HDF5 and not a directory.

answered Nov 01 '17 at 17:23

Jessica Alan

690
1
7
11

There is a designated function for this in `keras`. – Marcin Możejko Nov 01 '17 at 17:25
Yes, but OP was asking for an example of how to write a custom data generator for use cases not covered by that function. This answers that question. You are correct that they may be better off simply taking the images out of the HDF5 and using `flow_from_directory`. – Jessica Alan Nov 01 '17 at 17:26
No - he hasn't mention `flow_from_directory` not even once. He mention loading images from `h5` and then using `flow`. – Marcin Możejko Nov 01 '17 at 17:27
@Jeff Alan and any pointers on how could I include the image augmentation functionalities with the custom generator? – logi0517 Nov 01 '17 at 17:32
@MarcinMożejko if all else fails, I might try to use the flow_from_directory function, it was not my go to for 2 reasons: I assume directly reading in the arrays is faster and the food-101 data source has subdirectories for only the categories. So I would have to write extra code to split the 1000 images per category 3 ways. – logi0517 Nov 01 '17 at 17:35
If you write your own generator, you'll have to code them manually. I agree with Marcin that you are probably better off loading the images from directory using the native ImageDataGenerator. – Jessica Alan Nov 01 '17 at 17:35
@Jeff Alan I only saw examples with flow_from_directory on smaller datasets so far, if I use that function, will it be more "intelligent" than flow(), and only read in data to the memory 1 batch at a time? – logi0517 Nov 01 '17 at 17:37
Yes, that is the idea. `flow()` takes the entire dataset (as an array) as its parameters, `flow_from_directory()` takes a directory containing it. This allows you to leave the majority of the dataset on your hard drive while only loading the batch you are currently training on. – Jessica Alan Nov 01 '17 at 17:41

Keras custom data generator for large hdf5 file which does not fit into memory

4 Answers4

Linked