Splitting a tensorflow dataset into training, test, and validation sets from keras.preprocessing API

Question

I'm new to tensorflow/keras and I have a file structure with 3000 folders containing 200 images each to be loaded in as data. I know that keras.preprocessing.image_dataset_from_directory allows me to load the data and split it into training/validation set as below:

val_data = tf.keras.preprocessing.image_dataset_from_directory('etlcdb/ETL9G_IMG/', 
                                                           image_size = (128, 127),
                                                           validation_split = 0.3,
                                                           subset = "validation",
                                                           seed = 1,
                                                           color_mode = 'grayscale',
                                                           shuffle = True)

Found 607200 files belonging to 3036 classes. Using 182160 files for validation.

But then I'm not sure how to further split my validation into a test split while maintaining proper classes. From what I can tell (through the GitHub source code), the take method simply takes the first x elements of the dataset, and skip does the same. I am unsure if this maintains stratification of the data or not, and I'm not quite sure how to return labels from the dataset to test it.

Any help would be appreciated.

Sonjoy Das · Answer 1 · 2022-11-29T07:04:54.960

You almost got the answer. The key is to use .take() and .skip() to further split the validation set into 2 datasets -- one for validation and the other for test. If I use your example, then you need to execute the following lines of codes. Let's assume that you need 70% for training set, 10% for validation set, and 20% for test set. For the sake of completeness, I am also including the step to generate the training set. Let's also assign a few basic variables that must be same when first splitting the entire data set into training and validation sets.

seed_train_validation = 1 # Must be same for train_ds and val_ds
shuffle_value = True
validation_split = 0.3

train_ds = tf.keras.utils.image_dataset_from_directory(
directory ='etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = validation_split,
subset = "training",
seed = seed_train_validation,
color_mode = 'grayscale',
shuffle = shuffle_value)

val_ds = tf.keras.utils.image_dataset_from_directory(
directory ='etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = validation_split,
subset = "validation",
seed = seed_train_validation,
color_mode = 'grayscale',
shuffle = shuffle_value)

Next, determine how many batches of data are available in the validation set using tf.data.experimental.cardinality, and then move the two-third of them (2/3 of 30% = 20%) to a test set as follows. Note that the default value of batch_size is 32 (re: documentation).

val_batches = tf.data.experimental.cardinality(val_ds)
test_ds = val_ds.take((2*val_batches) // 3)
val_ds = val_ds.skip((2*val_batches) // 3)

All the three datasets (train_ds, val_ds, and test_ds) yield batches of images together with labels inferred from the directory structure. So, you are good to go from here.

score 3 · Answer 2 · answered Feb 24 '21 at 18:38

I could not find supporting documentation, but I believe image_dataset_from_directory is taking the end portion of the dataset as the validation split. shuffle is now set to True by default, so the dataset is shuffled before training, to avoid using only some classes for the validation split. The split done by image_dataset_from_directory only relates to the training process. If you need a (highly recommended) test split, you should split your data beforehand into training and testing. Then, image_dataset_from_directory will split your training data into training and validation.

I usually take a smaller percent (10%) for the in-training validation, and split the original dataset 80% training, 20% testing. With these values, the final splits (from the initial dataset size) are:

80% training:
- 72% training (used to adjust the weights in the network)
- 8% in-training validation (used only to check the metrics of the model after each epoch)
20% testing (never seen by the training process at all)

There is additional information how to split data in your directories in this question: Keras split train test set when using ImageDataGenerator

score 2 · Answer 3 · answered Apr 05 '21 at 15:30

2

For splitting into train and validation maybe you can do smth like that.

The main point is to keep the same seed.

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory,
    label_mode='categorical',
    validation_split=0.2,
    subset="training",
    seed=1337,
    color_mode="grayscale",
    image_size=image_size,
    batch_size=batch_size,
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory,
    validation_split=0.2,
    subset="validation",
    label_mode='categorical',
    seed=1337,
    color_mode="grayscale",
    image_size=image_size,
    batch_size=batch_size,
)

is taken from: https://keras.io/examples/vision/image_classification_from_scratch/

answered Apr 05 '21 at 15:30

Michael D

1,711
4
23
38

2

how would you make the testing set with image_dataset_from_directory()? currently having the issue of not having a test set because my dataset is too large to load with anything other than this function – wetmoney Dec 08 '21 at 01:36
@wetmoney, If you mean to split to train validation in test it's not straightforward from here. Maybe I would try to create a separate folder for tests and read files from there. If someone has a better suggestion- please let me know. – Michael D Dec 09 '21 at 09:40
4

wish this function allowed "testing" as a parameter for subset – wetmoney Dec 09 '21 at 17:40
From his question, it's clear wetmoney was already aware of this functionality. – Mike Johnson Jr Mar 24 '22 at 08:04

Splitting a tensorflow dataset into training, test, and validation sets from keras.preprocessing API

3 Answers3