1

I used a template of a custom image generator in keras, so that I can use hdf5 files as input. Initially, the code was giving a "shape" error, so I only included from tensorflow.python.keras.utils.data_utils import Sequence following this post. Now I use it in this form, as you can also see in my colab notebook:

from numpy.random import uniform, randint
from tensorflow.python.keras.preprocessing.image import ImageDataGenerator
import numpy as np
from tensorflow.python.keras.utils.data_utils import Sequence

class CustomImagesGenerator(Sequence):
    def __init__(self, x, zoom_range, shear_range, rescale, horizontal_flip, batch_size):
        self.x = x
        self.zoom_range = zoom_range
        self.shear_range = shear_range
        self.rescale = rescale
        self.horizontal_flip = horizontal_flip
        self.batch_size = batch_size
        self.__img_gen = ImageDataGenerator()
        self.__batch_index = 0

    def __len__(self):
        # steps_per_epoch, if unspecified, will use the len(generator) as a number of steps.
        # hence this
        return np.floor(self.x.shape[0]/self.batch_size)

    # @property
    # def shape(self):
    #     return self.x.shape

    def next(self):
        return self.__next__()

    def __next__(self):
        start = self.__batch_index*self.batch_size
        stop = start + self.batch_size
        self.__batch_index += 1
        if stop > len(self.x):
            raise StopIteration
        transformed = np.array(self.x[start:stop])  # loads from hdf5
        for i in range(len(transformed)):
            zoom = uniform(self.zoom_range[0], self.zoom_range[1])
            transformations = {
                'zx': zoom,
                'zy': zoom,
                'shear': uniform(-self.shear_range, self.shear_range),
                'flip_horizontal': self.horizontal_flip and bool(randint(0,2))
            }
            transformed[i] = self.__img_gen.apply_transform(transformed[i], transformations)
        import pdb;pdb.set_trace()
        return transformed * self.rescale

And I call the generator with:

import h5py
import tables 

in_hdf5_file = tables.open_file("gdrive/My Drive/Colab Notebooks/dataset.hdf5", mode='r')
images = in_hdf5_file.root.train_img

my_gen = CustomImagesGenerator(
    images,
    zoom_range=[0.8, 1],
    batch_size=32,
    shear_range=6, 
    rescale=1./255, 
    horizontal_flip=False
)

classifier.fit_generator(my_gen, steps_per_epoch=100, epochs=1, verbose=1)

The import of Sequence resolved the "shape" error, but now I am getting the error:

Exception in thread Thread-5: Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/data_utils.py", line 742, in _run sequence = list(range(len(self.sequence))) TypeError: 'numpy.float64' object cannot be interpreted as an integer

How can I resolve this? I suspect it might be again a conflict in the keras packages and don't know how to tackle it.

NeStack
  • 1,739
  • 1
  • 20
  • 40
  • what is the `dtype` of your data? – Oleg Vorobiov Aug 29 '19 at 16:53
  • @okawo From the above `images`, if I do `print(images.dtype)` I get `dtype('uint8')`. Does this help in some way? Thanks – NeStack Aug 29 '19 at 17:53
  • sorry for the delay, but try changing the `dtype` of your images to `np.float32` or `np.float64`. if your images are np.arrays you can do it by: `arr.astype(dtype)` – Oleg Vorobiov Sep 01 '19 at 08:31
  • @okawo I changed `dtype` to different types (float32, float64, int) when creating the hdf5 image file, but it didn't change the error message, every time it says there is a problem with 'numpy.float64'. So I think there is a conflict between some python packages. You can also have a look at my script that creates the hdf5 image files, I pasted in at the bottom of my colab notebook (link in the post) – NeStack Sep 01 '19 at 14:25
  • try removing `classifier.add(InputLayer(input_shape=data_shape))` from your model, it appears that you have 2 layers with input in your sequential model – Oleg Vorobiov Sep 01 '19 at 14:45
  • @okawo Well spotted, but this doesn't change anything, the same error message is displayed. You could try to create the hdf5 on your own, you just need to point the script to a directory where you have jpg-file. and then reproduce the error by running my custom generator. Thanks for providing answers! – NeStack Sep 01 '19 at 14:54
  • ok so i have run some tests, it appears that your generator class is the problem, i can run your code and train your model if i use just `model.fit()` and not `model.fit_generator()`. also your generator class can't output training data, it can only output images, but to train your model you need `x=images, y=labels`, plus you already have your data saved and processed, why not just use `fit()` with your data loaded as is from your hdf5 file? – Oleg Vorobiov Sep 01 '19 at 15:55

1 Answers1

1

Example usage with model.fit() in your case:

from tensorflow.keras.utils import to_categorical
import tensorflow as tf
import tables

#define your model

...

#load your data from an hdf5 file
in_hdf5_file = tables.open_file("path/to/your/dataset.hdf5", mode='r')
x = in_hdf5_file.root.train_img[:]
y = in_hdf5_file.root.train_labels[:]

yourModel.fit(x, to_categorical(y, 3), epochs=2, batch_size=5)

For more info read my comments to your original post, or feel free to ask.

EDIT: I fixed your generator, now it only needs a path to your hdf5 file

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import *
from tensorflow.keras.utils import to_categorical

import numpy as np
from tensorflow.python.keras.utils.data_utils import Sequence
import tensorflow as tf

import tables

#define your model

...

#training
def h5data_generator(path, batch_size=1):
    batch_index = 0
    while 1:
        with tables.open_file(path, mdoe='r') as f:
            x = f.root.train_img[batch_index:batch_index + batch_size]
            y = f.root.train_labels[batch_index:batch_index + batch_size]

            if batch_index >= x.shape[0]:
                batch_index = 0

            batch_index += 1

            yield (x, to_categorical(y, 3))

            del x
            del y


my_gen = h5data_generator("path/to/your/dataset.hdf5")

yourModel.fit_generator(my_gen, steps_per_epoch=100, epochs=20, verbose=1)

The problem with your generator was bad data output on step, it wasn't outputting (x, y), there was no way it could, it was outputting x(image in your case), also since it was using Sequential keras tried to interpret it as an object that used it's api(not the case in your generator). Also it doesn't have to be a class, it needs to be a python generator, as indicated by an example in keras it self(doc string of fit_generator()),

fit_generator.__doc__:

Fits the model on data yielded batch-by-batch by a Python generator.

    The generator is run in parallel to the model, for efficiency.
    For instance, this allows you to do real-time data augmentation
    on images on CPU in parallel to training your model on GPU.

    The use of `keras.utils.Sequence` guarantees the ordering
    and guarantees the single use of every input per epoch when
    using `use_multiprocessing=True`.

    Arguments:
        generator: A generator or an instance of `Sequence`
          (`keras.utils.Sequence`)
            object in order to avoid duplicate data
            when using multiprocessing.
            The output of the generator must be either
            - a tuple `(inputs, targets)`
            - a tuple `(inputs, targets, sample_weights)`.
            This tuple (a single output of the generator) makes a single batch.
            Therefore, all arrays in this tuple must have the same length (equal
            to the size of this batch). Different batches may have different
              sizes.
            For example, the last batch of the epoch is commonly smaller than
              the
            others, if the size of the dataset is not divisible by the batch
              size.
            The generator is expected to loop over its data
            indefinitely. An epoch finishes when `steps_per_epoch`
            batches have been seen by the model.
        steps_per_epoch: Total number of steps (batches of samples)
            to yield from `generator` before declaring one epoch
            finished and starting the next epoch. It should typically
            be equal to the number of samples of your dataset
            divided by the batch size.
            Optional for `Sequence`: if unspecified, will use
            the `len(generator)` as a number of steps.
        epochs: Integer, total number of iterations on the data.
        verbose: Verbosity mode, 0, 1, or 2.
        callbacks: List of callbacks to be called during training.
        validation_data: This can be either
            - a generator for the validation data
            - a tuple (inputs, targets)
            - a tuple (inputs, targets, sample_weights).
        validation_steps: Only relevant if `validation_data`
            is a generator. Total number of steps (batches of samples)
            to yield from `generator` before stopping.
            Optional for `Sequence`: if unspecified, will use
            the `len(validation_data)` as a number of steps.
        validation_freq: Only relevant if validation data is provided. Integer
            or `collections.Container` instance (e.g. list, tuple, etc.). If an
            integer, specifies how many training epochs to run before a new
            validation run is performed, e.g. `validation_freq=2` runs
            validation every 2 epochs. If a Container, specifies the epochs on
            which to run validation, e.g. `validation_freq=[1, 2, 10]` runs
            validation at the end of the 1st, 2nd, and 10th epochs.
        class_weight: Dictionary mapping class indices to a weight
            for the class.
        max_queue_size: Integer. Maximum size for the generator queue.
            If unspecified, `max_queue_size` will default to 10.
        workers: Integer. Maximum number of processes to spin up
            when using process-based threading.
            If unspecified, `workers` will default to 1. If 0, will
            execute the generator on the main thread.
        use_multiprocessing: Boolean.
            If `True`, use process-based threading.
            If unspecified, `use_multiprocessing` will default to `False`.
            Note that because this implementation relies on multiprocessing,
            you should not pass non-picklable arguments to the generator
            as they can't be passed easily to children processes.
        shuffle: Boolean. Whether to shuffle the order of the batches at
            the beginning of each epoch. Only used with instances
            of `Sequence` (`keras.utils.Sequence`).
            Has no effect when `steps_per_epoch` is not `None`.
        initial_epoch: Epoch at which to start training
            (useful for resuming a previous training run)

    Returns:
        A `History` object.

    Example:

    ```python
        def generate_arrays_from_file(path):
            while 1:
                f = open(path)
                for line in f:
                    # create numpy arrays of input data
                    # and labels, from each line in the file
                    x1, x2, y = process_line(line)
                    yield ({'input_1': x1, 'input_2': x2}, {'output': y})
                f.close()

        model.fit_generator(generate_arrays_from_file('/my_file.txt'),
                            steps_per_epoch=10000, epochs=10)
    ```
    Raises:
        ValueError: In case the generator yields data in an invalid format.

For more info check out the github page of keras, fit_generator() to be exact, or again feel free to ask.

EDIT 2: You can also pass batch_size to the h5data_generator(), which will set the batch size of data getting pulled from your dataset on single step.

Oleg Vorobiov
  • 482
  • 5
  • 14
  • Thanks, it looks like your suggestion works! I will test it out and return with more elaborate feedback – NeStack Sep 01 '19 at 23:23
  • ok good, i will also try to fix your generator, but it will take some time because life – Oleg Vorobiov Sep 02 '19 at 08:02
  • Thanks a lot! I made a few runs with your `hdf5_generator` and indeed the epochs are much shorter. Though I notice the neural network doesn't really learn - the loss stays constant at ~10 and the accuracy doesn't improve. Did you observe this, too, do you have an explanation? – NeStack Sep 02 '19 at 19:46
  • in my tests i was using your model, however i was also using random images that fit your model's input and random labels and i only had 6 of them, but i changed the optimizer to Adam. during test runs model went to `loss=0` `acc=1` in a few epochs. but you can try to play with learning rate and/or add more layers to your model. oh and if it's not a secret what are you classifying? – Oleg Vorobiov Sep 02 '19 at 20:21
  • my optimizer during testing: `tf.train.AdamOptimizer(0.001)` where `0.001` is the learning rate and `tf` is `tensorflow` – Oleg Vorobiov Sep 02 '19 at 20:23