-4

I am using Tensorflow Datasets' tfds.load function to load my data:

import tensorflow_datasets as tfds
import tensorflow as tf

(raw_train, raw_validation, raw_test), metadata = tfds.load(
    'cats_vs_dogs',
    split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'],
    with_info=True,
    as_supervised=True,
)

Now, I have some additional images of cats and dogs on my local pc (for example Cat1.jpg). I would like to add them to these data. How can I do this?

Note that I have not just one file, but a lot and furthermore this is just a binary classification example; same question holds for multi-class classification, so it would be good to also have a solution for that.

Update: I tried different ways, like trying to read in images with tf-nightly with tf.keras.preprocessing.image_dataset_from_directory, however, it is not that easy, unforunately. There are a lot of problems, like the resulting dataset is in different dtype and cannot be merged with the original one. I have no solution for this problem. I put a bounty on it, because I really need detailed code, a working solution and not just some general thoughts how in theory this could be achieved. I don't need a solution with image_dataset_from_directory, if anyone has any solution, detailed code which works, I am fine with that.

I did not want to post any code, as I think there are better ways to solve this. However, please find the way I tried it here (in colab):

!pip install tf-nightly
#!pip uninstall tf-nightly

import tensorflow as tf
print(tf.__version__)

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    '/tmp/Test/',
    image_size = (224,224),
    batch_size = 32,
    # label_mode = 'int'
)

There is a Test folder in tmp. One subfolder cat and the other dog. Includes some random pictures from search for cat and dog.

Resulting train_ds is a <BatchDataset shapes: ((None, 224, 224, 3), (None,)), types: (tf.float32, tf.int32)>

import os
import shutil

os.listdir("/tmp/Test") #First find where the ".ipynb_checkpoints" is located.

shutil.rmtree("/tmp/Test/.ipynb_checkpoints")

import tensorflow_datasets as tfds
(raw_train, raw_validation, raw_test), metadata = tfds.load(
    'cats_vs_dogs',
    split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'],
    with_info=True,
    as_supervised=True,
)

raw_train for example is a <DatasetV1Adapter shapes: ((None, None, 3), ()), types: (tf.uint8, tf.int64)>.

  def _normalize_img(img, label):
  img = tf.cast(img, tf.float32) / 255.
  img = tf.image.resize(img, (224,224))
  label = tf.cast(label, tf.int64)
  img = tf.cast(img, tf.uint8)
  return (img, label)
  # ds = tfds.load('mnist', split='train', as_supervised=True)
  ds = raw_train.map(_normalize_img)

ds is now a <DatasetV1Adapter shapes: ((224, 224, 3), ()), types: (tf.uint8, tf.int64)>

test=ds.concatenate(raw_train)

Does not solve it, as data is not properly matched/concatenated. Furthermore in multi-class case I have no control to check the match of the labels.

So I do not need any general thoughts about how this could be achieved in theory. I need a detailed working solution, detailed code. And not just for binary as here in this example, but I also need it for multi-class problems, as I also have this problem there. So how to match the "read-in labels" with the labels resulting from tfds.load in multi-class case. That there are no miss-matching, like mixing the classes or so. E.g. cats becomes horse (in case of cats vs dogs vs horses).

Second way: I also tried to point a ImageDataGenerator directly to the raw_train dataset. If that worked I could have proceeded with using ImageDataGenerator in general, alhough I actually did not want this. So I just want to add images to the raw_train dataset. I tried this:

from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_image_generator = ImageDataGenerator(
    rescale=1./255,
)

train_datagen = train_image_generator.flow_from_directory(
  directory=raw_train,
  target_size=(224, 224),
  shuffle=True,
  batch_size=128,
  class_mode='binary'
)

And then match/concatenate the results of these datagenerators. But it is not possible to just point this on raw_train, it gives an error.

Stat Tistician
  • 813
  • 5
  • 17
  • 45

1 Answers1

-3

The objects returned by tfds.load are instances of tf.data.Dataset. Therefore, you can build a new tf.data.Dataset instance of your local images, and then use concatenate method to join them together. To build such a dataset from the images on disk, at least there are three different ways:

  • You can use the newly added tf.keras.preprocessing.image_dataset_from_directory function. For the moment, this is only available in tf-nightly. You can find a sample example of working with this function here.

  • Alternatively, you can use tf.data API for having much more control over loading process as well as further transformations on images and their labels. Here is a sample example on how to achieve this.

  • Or you can first load the images using whatever library/method as a Numpy array, and also construct another array corresponding to their labels. Then you can create a tf.data.Dataset instance using from_tensor_slices method. You can find an example here. Note that this method is NOT recommended if you have lots of images (which in turn means that the size of the constructed Numpy array would be very large and therefore makes the data pipeline memory-wasteful or impossible to build).

today
  • 32,602
  • 8
  • 95
  • 115
  • I put a bounty on it, because I need a working solution. I tried the way you described here, however there are a lot of problems, like e.g. the dataset from image_dataset_from_directory is in a completely different dtype and therefore cannot be concatenated with the original dataset. – Stat Tistician Jul 19 '20 at 09:38
  • 2
    @StatTistician For each approach I mentioned, I have given a link as a guide to help you understand how it works and let you implement it for yourself. That's the responsibility of you to read them and adapt them for your use case. And if you encountered any error or obstacle you need to ask for help with detailed info in a **separate question**. Let me remind you that this site is not a homework/project solver for you and for your specific use cases. There could be tens of variations on this problem (dtype of arrays, format of images, normalizing, format of labels, etc.). >>>> – today Jul 19 '20 at 12:40
  • 3
    @StatTistician >>>> And therefore that's your responsibility to adapt these solutions to your specific use case and ask for help if any error occurs (of course, in a separate question and providing enough details). We aren't here to cover tens of variations for you or anybody else so that you could easily solve your homework/project/job task. And finally, putting a bounty does not justify the expectation of "I need a complete working solution for my specific use case"; especially when there are guides/docs out there which cover this (and have been linked to in my answer). – today Jul 19 '20 at 12:42