Split data directory into training and test directory with sub directory structure preserved

Question

I am interested in using ImageDataGenerator in Keras for data augmentation. But it requires that training and validation directories with sub directories for classes be fed in separately as below (this is from Keras documentation). I have a single directory with 2 subdirectories for 2 classes (Data/Class1 and Data/Class2). How do I randomly split this into training and validation directories

    train_datagen = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

    test_datagen = ImageDataGenerator(rescale=1./255)

    train_generator = train_datagen.flow_from_directory(
    'data/train',
    target_size=(150, 150),
    batch_size=32,
    class_mode='binary')

   validation_generator = test_datagen.flow_from_directory(
    'data/validation',
    target_size=(150, 150),
    batch_size=32,
    class_mode='binary')

   model.fit_generator(
    train_generator,
    steps_per_epoch=2000,
    epochs=50,
    validation_data=validation_generator,
    validation_steps=800)

I am interested in re-running my algorithm multiple times with random training and validation data splits.

score 22 · Answer 1 · answered Oct 20 '17 at 21:54

Thank you guys! I was able to write my own function to create training and test data sets. Here's the code for anyone who's looking.

import os
source1 = "/source_dir"
dest11 = "/dest_dir"
files = os.listdir(source1)
import shutil
import numpy as np
for f in files:
    if np.random.rand(1) < 0.2:
        shutil.move(source1 + '/'+ f, dest11 + '/'+ f)

score 7 · Answer 2 · answered Nov 21 '18 at 18:06

https://stackoverflow.com/a/52372042/10111155 provided the easiest way: ImageDataGenerator now supports splitting into train/test from a single directory with subdirectories directly.

This is copied directly from that answer with no changes. I take no credit. I tried it and it worked perfectly.

Note that train_data_dir is the same in the train_generator and validation_generator. If you want a three-way split (train/test/valid) using ImageDataGenerator, the source code will need to be modified --- there are nice instructions here.

train_datagen = ImageDataGenerator(rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    validation_split=0.2) # set validation split

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='binary',
    subset='training') # set as training data

validation_generator = train_datagen.flow_from_directory(
    train_data_dir, # same directory as training data
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='binary'
    subset='validation') # set as validation data

model.fit_generator(
    train_generator,
    steps_per_epoch = train_generator.samples // batch_size,
    validation_data = validation_generator, 
    validation_steps = validation_generator.samples // batch_size,
    epochs = nb_epochs)

I wonder if we can do a three way split by creating another instance of ImageDataGenerator without the validation_split argument? — Yigit Alparslan, Apr 19 '20 at 10:39
I would lıke to point out that this way you will also apply augmentations to validation dataset which we generally would not want to do. — ARAT, Mar 12 '21 at 09:52

score 3 · Answer 3 · answered Mar 19 '19 at 14:13

If you only want to split the image data without applying any transformations to the images, use the following code.

from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(
        validation_split=0.4)

train_generator = train_datagen.flow_from_directory(
        'path_to_data_directory',
        subset='training')

validation_generator = train_datagen.flow_from_directory(
        'path_to_data_directory', #same as in train generator
        subset='validation')

This takes the given 'path_to_data_directory' and takes images from the sub-folders from that directory and assigns the respective sub-folder name as class-name of the image.

Sample output

Found 43771 images belonging to 9385 classes.
Found 22490 images belonging to 9385 classes.

You can use model.fit_generator to load this data to your model.

Refer to https://keras.io/preprocessing/image/ for details.

score 2 · Answer 4 · answered Oct 14 '17 at 14:39

2

Unfortunately, it's impossible for the current implementation of keras.preprocessing.image.ImageDataGenerator (as for October 14th, 2017) but as it's a really requested feature I expect it to be added in the nearest future.

But you could do this using standard Python os operations. Depending on the size of your dataset you could also try to first load all images to RAM and then use a classical fit method which could split your data randomly.

answered Oct 14 '17 at 14:39

Marcin Możejko

39,542
10
109
120

Thanks you. I was able to write a function to create these libraries. – Sharanya Arcot Desai Oct 19 '17 at 22:50

blackHoleDetector · Answer 5 · 2017-10-14T16:19:12.747

1

You will need to either manually copy out some of your training data and paste it into a validation directory, or create a program to randomly move data from your training directory to your validation directory. With either of these options, you will need to pass in the validation directory to your validation ImageDataGenerator().flow_from_directory() as the path.

Details for organizing your data in the directory structure are covered in this video.

edited Oct 14 '17 at 16:19

answered Oct 12 '17 at 20:11

blackHoleDetector

2,975
2
13
13

Thanks for your answer. But I did not see validation_split as a parameter in fit_generator, and fit_generator is what I want to use.It's a parameter in the fit function. – Sharanya Arcot Desai Oct 13 '17 at 19:33
Ah, you're right. I was thinking it was a parameter in both fit() and fit_generator(), but it is only for fit(). I've updated my answer. You will have to either manually or programmatically create your directory structure for both valid and train sets, and then point to these separate directories with your ImageDataGenerators for each of these sets. – blackHoleDetector Oct 14 '17 at 16:21

score 0 · Answer 6 · answered Mar 14 '18 at 13:55

Here's my approach:

# Create temporary validation set.
with TemporaryDirectory(dir=train_image_folder) as valid_image_folder, TemporaryDirectory(dir=train_label_folder) as valid_label_folder:
    train_images = os.listdir(train_image_folder)
    train_labels = os.listdir(train_label_folder)

    for img_name in train_images:
        single_name, ext = os.path.splitext(img_name)
        label_name = single_name + '.png'
        if label_name not in train_labels:
            continue
        if random.uniform(0, 1) <= train_val_split:
            # Move the files.
            shutil.move(os.path.join(train_image_folder, img_name), os.path.join(valid_image_folder, img_name))
            shutil.move(os.path.join(train_label_folder, label_name), os.path.join(valid_label_folder, img_name))

Don't forget to move everything back.

score 0 · Answer 7 · answered May 23 '18 at 18:07

You solution worked, thanks.

   import os
   import shutil
   import numpy as np

   sourceN = base_dir + "\\train\\NORMAL\\"
   destN = base_dir + "\\val\\NORMAL"
   sourceP = base_dir + "\\train\\PNEUMONIA"
   destP = base_dir + "\\val\\PNEUMONIA"

   filesN = os.listdir(sourceN)
   filesP = os.listdir(sourceP)       

   for f in filesN:
       if np.random.rand(1) < 0.2:
       shutil.move(sourceN + '\\'+ f, destN + '\\'+ f)

   for i in filesP:
       if np.random.rand(1) < 0.2:
       shutil.move(sourceP + '\\'+ i, destP + '\\'+ i)

   print(len(os.listdir(sourceN)))
   print(len(os.listdir(sourceP)))
   print(len(os.listdir(destN)))
   print(len(os.listdir(destP)))

I dont think this would work in Linux. Backslash is for the Windows path. — Yigit Alparslan, Apr 19 '20 at 09:08

Split data directory into training and test directory with sub directory structure preserved

7 Answers7