1

Which kedro dataset should be used when working with images and keras ImageDataGenerator? I know there is ImageDataset but the number of images is too large to fit in memory. And all that keras ImageDataGenerator really needs is a local folder location to the image dataset in the form of:

data/
    train/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001.jpg
            cat002.jpg
            ...
    validation/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001.jpg
            cat002.jpg
            ...

It would be possible to use a parameter specifying the data location but I think the appropriate location for data should be the Data Catalog. Is there a simple way to specify this data location in the Data Catalog?

evolved
  • 1,850
  • 19
  • 40

2 Answers2

1

How about setting the path in parameters.yml and then read that as an input to your ImageDataGenerator. It could look something like:

train_dogs_location: data/train/dogs/

Modify the above example based on what is best. You can also consider setting a global path for all datasets in the conf/base/globals.yml file. For example, for your root data folder.

  • Thanks @Shubham Agrawal for your answer. That is what I am currently doing but I'd like to know if there is a way to specify the data location inside the Data Catalog (where it belongs in my opinion)? – evolved Oct 05 '20 at 17:25
  • I heavily doubt any of the Kedro datasets have such a property. You might be better off using a parameter file. Storing the location and dataset in some form of dictionary and then PickleDataSet could be another option if you have to use a DataSet. – Shubham Agrawal Oct 05 '20 at 18:01
  • Thank you, I think I'll go/stay with the parameter file option because having the dataset organized in the Data Catalog is not mandatory. However, I think kedro is missing some sort of tf.data.Dataset type that can be used in the Data Catalog. – evolved Oct 05 '20 at 18:29
  • I might agree with you. I think it might be worthwhile for you to open a ticket/issue on their repo and explain why this might be a beneficial feature. Might gather more support/insights from the community. – Shubham Agrawal Oct 05 '20 at 21:57
1

There are two parts of your question which I think are important to separate;

  1. Is it possible configure a custom ImageDataGenerator dataset? (TLDR; yes)
  2. Is it possible to configure the above with file path parameters that match my use case? (TLDR; yes but you probably don't want your directory structure being the default view of directory structures as other users might not be able to use it as well).

Is it possible configure a custom ImageDataGenerator dataset?

Here's a little bit of incomplete python code that you could use to build out a custom dataset. I'll leave it to you to get it in working shape if you want a solution like this. Look into the sample datasets in the Kedro Github Repo for inspiration and a tutorial on creating custom datasets from the kedro readthedocs.

import tensorflow as tf 
from kedro.io.core import AbstractDataSet

class ImageDataGeneratorDataSet(AbstractDataSet):

    def __init__(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None):
        
        self.filepath = filepath
        self.load_args = load_args
        self.save_args = save_args 


    def load(self):
        generator = tf.keras.preprocessing.image.ImageDataGenerator(**self.load_args)
        return generator.flow_from_directory(self.filepath)

    def save(self, data):
        raise Exception("Saving with the ImageDataGeneratorDataSet is not supported")

2: Is it possible to configure the above with file path parameters that match my use case?

While we could modify the above to take in some parameters and return different iterators, this might give us issues if the directory structure is different. This is because parameterisation largely relies on common conventions.

If your convention is data/{train/validation}/{dog/cat}/images... your solution to extract and apply parameters is likely to be coupled to the respective order of train/validation and dog/cat and would likely not work for a different user who might have a convention of data/{dog/cat}/{train/validation}/images....

What would perhaps be a better pattern would be to implement a solution (like I've outlined in the first section), register a dataset in the catalog for each of your different training/validation data and combine the iterators at runtime within your nodes to create train and validation iterators.

For example you would have the datasets train_cats, train_dogs, validation_cats, validation_dogs. Within the node you could izip these iterators together (see https://stackoverflow.com/a/243902/13341083).

If you end up pursuing this approach, please raise a PR and contribute :) Best of luck

Robert Christopher
  • 4,940
  • 1
  • 20
  • 21
  • Thanks @William Ashford. I thought about writing a custom DataSet too. But the main concern I have is that [`tf.keras.preprocessing.image.ImageDataGenerator`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_directory) returns a `DirectoryIterator` and I don't know if this can be pickled, which is required by kedro I think, to pass the output between different nodes. – evolved Oct 06 '20 at 07:51
  • I know pickling/deep-copying doesn't directly work with `tf.data.Dataset`, which is what I finally want, using the keras ImageDataGenerator. I've asked about this in [another question](https://stackoverflow.com/q/63730066/2137370). Do you know if the `copy_mode` = `assign` of `MemoryDataset` does no deep copy? – evolved Oct 06 '20 at 07:52
  • Anyway the `ImageDataGeneratorDataset` makes sense, if it works, and I'll come up with a PR. – evolved Oct 06 '20 at 07:52
  • Regarding the second part: I agree only partly with you because as far as I know, the keras `ImageDataGenerator` expects the following folder hierarchy convention: `data/{class_0, class_1, ... class_n}`. Where data can be either train or validation. Please see the `classes` and `subsets` argument of [`flow_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_directory) for more details. – evolved Oct 06 '20 at 08:25
  • I have added a [PR](https://github.com/quantumblacklabs/kedro/pull/549) but it needs some discussion. – evolved Oct 06 '20 at 17:06
  • Yeah makes sense with the Keras convention. I wasn't aware of this. – William Ashford Oct 07 '20 at 08:09
  • Regarding the memory dataset, this would only come into play if you didn't have a catalog entry. As you mention though, you can manually configure a MemoryDataSet to avoid copy or deep copy by using assign. – William Ashford Oct 07 '20 at 08:12
  • Does my answer answer your question? – William Ashford Oct 07 '20 at 08:15