0

suppose I have directories like this.

full_dataset
|---horse <= 40 images of horse
|---donkey <= 50 images of donkey
|---cow <= 80 images of cow
|---zebra <= <= 30 images of zebra

Then I write this with tensorflow

image_generator = ImageDataGenerator(rescale=1./255)    
my_dataset = image_generator.flow_from_directory(batch_size=32,
                                                 directory='full_dataset',
                                                 shuffle=True,
                                                 target_size=(280, 280),
                                                 class_mode='categorical')

But I want to automatically split that file, without manually change the directory to train folder and test folder. I don't want to do manually split it like https://www.tensorflow.org/tutorials/images/classification)

What I have done and Failed

(x_train, y_train),(x_test, y_test) = my_dataset.load_data()
Ichsan
  • 768
  • 8
  • 12

2 Answers2

1

You don't have to use tensorflow or keras to divide your dataset. If you have sklearn package installed then you can simply use it:

from sklearn.model_selection import train_test_split
X = ...
Y = ...
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

You can also use numpy for the same purpose:

import numpy
X = ...
Y = ...
test_size = 0.2
train_nsamples = (1-test_size) * len(Y)
x_train, x_test, y_train, y_test = X[:train_nsamples,:], X[train_nsamples:, :], Y[:train_nsamples, ], Y[train_nsamples:,]

In Keras:

from keras.datasets import mnist
import numpy as np
from sklearn.model_selection import train_test_split

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x = np.concatenate((x_train, x_test))
y = np.concatenate((y_train, y_test))

train_size = 0.7
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=train_size)
bsquare
  • 943
  • 5
  • 10
  • hello, thanks for answer, but I'm sorry. How to input X and Y while I used images in "full_dataset" folder as input. If it was pandas dataframe, it more easily. But It was images – Ichsan Mar 23 '20 at 17:20
  • 1
    @Ichsan, Updated answer by using mnist data for X and Y input. – bsquare Mar 24 '20 at 05:19
  • How to use "load_data()" function to my folder called "full_dataset"?. It worked if it use Mnist dataset. But I want to try it with my folder called "full_dataset" like I shown at question – Ichsan Mar 24 '20 at 05:36
0

After trial and error and struggling for one day, I found the solution.

FIRST WAY

import glob
horse = glob.glob('full_dataset/horse/*.*')
donkey = glob.glob('full_dataset/donkey/*.*')
cow = glob.glob('full_dataset/cow/*.*')
zebra = glob.glob('full_dataset/zebra/*.*')

data = []
labels = []

for i in horse:   
    image=tf.keras.preprocessing.image.load_img(i, color_mode='RGB', 
    target_size= (280,280))
    image=np.array(image)
    data.append(image)
    labels.append(0)
for i in donkey:   
    image=tf.keras.preprocessing.image.load_img(i, color_mode='RGB', 
    target_size= (280,280))
    image=np.array(image)
    data.append(image)
    labels.append(1)
for i in cow:   
    image=tf.keras.preprocessing.image.load_img(i, color_mode='RGB', 
    target_size= (280,280))
    image=np.array(image)
    data.append(image)
    labels.append(2)
for i in zebra:   
    image=tf.keras.preprocessing.image.load_img(i, color_mode='RGB', 
    target_size= (280,280))
    image=np.array(image)
    data.append(image)
    labels.append(3)

data = np.array(data)
labels = np.array(labels)

from sklearn.model_selection import train_test_split
X_train, X_test, ytrain, ytest = train_test_split(data, labels, test_size=0.2,
                                                random_state=42)

SECOND WAY

image_generator = ImageDataGenerator(rescale=1/255, validation_split=0.2)    

train_dataset = image_generator.flow_from_directory(batch_size=32,
                                                 directory='full_dataset',
                                                 shuffle=True,
                                                 target_size=(280, 280), 
                                                 subset="training",
                                                 class_mode='categorical')

validation_dataset = image_generator.flow_from_directory(batch_size=32,
                                                 directory='full_dataset',
                                                 shuffle=True,
                                                 target_size=(280, 280), 
                                                 subset="validation",
                                                 class_mode='categorical')

Main drawback from Second way, you can't use for display a picture. It will error if you write validation_dataset[1]. But it worked if I use first way : X_test[1]

Ichsan
  • 768
  • 8
  • 12