How to split images into test and train set using my own data in TensorFlow

Question

I am a little confused here... I just spent the last hour reading about how to split my dataset into test/train in TensorFlow. I was following this tutorial to import my images: https://www.tensorflow.org/tutorials/load_data/images. Apparently one can split into train/test with sklearn: model_selection.train_test_split .

But my question is: when do I split my dataset into train/test. I already have done this with my dataset (see below), now what? How do I split it? Do I have to do it before loading the files as tf.data.Dataset?

# determine names of classes
CLASS_NAMES = np.array([item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt"])
print(CLASS_NAMES)

# count images
image_count = len(list(data_dir.glob('*/*.png')))
print(image_count)


# load the files as a tf.data.Dataset
list_ds = tf.data.Dataset.list_files(str(cwd + '/train/' + '*/*'))

Also, my data structure looks like the following. No test folder, no val folder. I would need to take 20% for test from that train set.

train
 |__ class 1
 |__ class 2
 |__ class 3

Vladimir Sotnikov · Accepted Answer · 2020-02-09T20:47:47.360

5

You can use tf.keras.preprocessing.image.ImageDataGenerator:

image_generator = tf.keras.preprocessing.image.ImageDataGenerator(validation_split=0.2)
train_data_gen = image_generator.flow_from_directory(directory='train',
                                                     subset='training')
val_data_gen = image_generator.flow_from_directory(directory='train',
                                                   subset='validation')

Note that you'll probably need to set other data-related parameters for your generator.

UPDATE: You can obtain two slices of your dataset via skip() and take():

val_data = data.take(val_data_size)
train_data = data.skip(val_data_size)

edited Feb 09 '20 at 20:47

answered Feb 08 '20 at 20:44

Vladimir Sotnikov

1,399
11
13

1

Got it! Thanks. But what if I used ```tf.data``` to load my images and then ```Dataset.map``` to create a dataset of image, label pairs? I have all my images now in ```train_ds = prepare_for_training(labeled_ds)``` How would you split it then? I'm following this tutorial: https://www.tensorflow.org/tutorials/load_data/images – Guillermina Feb 08 '20 at 22:18

score 0 · Answer 2 · answered Sep 27 '20 at 02:55

If you have all data in same folder and wanted to split into validation/testing using tf.data then do the following:

list_ds = tf.data.Dataset.list_files(str(cwd + '/train/' + '*/*'))
image_count = len(list(data_dir.glob('*/*.png')))

val_size = int(image_count * 0.2) 
train_set = list_ds.skip(val_size)
val_set = list_ds.take(val_size)

How to split images into test and train set using my own data in TensorFlow

2 Answers2