How to use K-Fold Cross Validation For This CNN?

Question

I have tried to implement K Fold Cross Validation for my binary image classifier, but I have been struggling for a while as I have been stuck with the whole data processing side of things. I have included my code below (it is quite long and messy - apologies) before my attempts at the K Fold as it went horribly wrong. Any suggestions or support would be greatly appreciated. I believe that using a K Fold is the right approach here, but if not, please let me know. Thank you so much!

I was wondering how I can reformat my data to create the separate folds as pretty much every tutorial out there uses a .csv file; however, I simply have two different folders containing images, either ordered into two separate categories (for the training data) or just one singular category (for the test data).

from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers import Dropout
from keras.regularizers import l2
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import EarlyStopping
import numpy as np
import matplotlib.pyplot as plt

classifier = Sequential()
classifier.add(Conv2D(32, (3 , 3), input_shape = (256, 256, 3), activation = 'relu', kernel_regularizer=l2(0.01)))
classifier.add(MaxPooling2D(pool_size=(2,2)))
classifier.add(Flatten())
classifier.add(Dense(units = 128, activation='relu'))
classifier.add(Dropout(0.5))
classifier.add(Dense(units=1, activation='sigmoid'))
classifier.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])

train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, validation_split=0.2)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
    'train',
    target_size=(256, 256),
    batch_size=32,
    class_mode='binary',
    subset='training') # set as training data

validation_generator = train_datagen.flow_from_directory(
    'train', # same directory as training data
    target_size=(256, 256),
    batch_size=32,
    class_mode='binary',
    subset='validation')
test_set = test_datagen.flow_from_directory('test', target_size = (256,256), batch_size=10, class_mode='binary')

history = classifier.fit_generator(train_generator, steps_per_epoch=40, epochs=100, validation_data=validation_generator)
classifier.save('50epochmodel')

test_images = np.array(list(next(test_set)[:1]))[0]
probabilities = classifier.predict(test_images)

Nicolas Gervais · Accepted Answer · 2020-10-14T21:10:26.263

2

For more flexibility you can use a simple loading function for files, rather than using a Keras generator. Then, you can iterate through a list of files and test against the remaining fold.

import os
os.chdir(r'catsanddogs')
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras import Sequential
from collections import deque
from glob2 import glob
import numpy as np

files = glob('*\\*\\*.jpg')
files = files[:-(len(files)%3)] # dataset is now divisible by 3 

indices = np.random.permutation(len(files)).reshape(3, -1)

imsize = 64


def load(file_path):
    img = tf.io.read_file(file_path)
    img = tf.image.decode_png(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.resize(img, size=(imsize, imsize))
    label = tf.strings.split(file_path, os.sep)[1]
    label = tf.cast(tf.equal(label, 'dogs'), tf.int32)
    return img, label

accuracies_on_test_set = {}

for i in range(len(indices)):
    d = deque(np.array(files)[indices].tolist())
    d.rotate(-i)
    train1, train2, test1 = d
    train_ds = tf.data.Dataset.from_tensor_slices(train1 + train2).\
        shuffle(len(train1) + len(train2)).map(load).batch(4)
    test_ds = tf.data.Dataset.from_tensor_slices(test1).\
        shuffle(len(test1)).map(load).batch(4)

    classifier = Sequential()
    classifier.add(Conv2D(8, (3, 3), input_shape=(imsize, imsize, 3), activation='relu'))
    classifier.add(MaxPooling2D(pool_size=(2, 2)))
    classifier.add(Flatten())
    classifier.add(Dense(units=32, activation='relu'))
    classifier.add(Dropout(0.5))
    classifier.add(Dense(units=1, activation='sigmoid'))
    classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    classifier.fit(train_ds, validation_data=test_ds, epochs=2, verbose=0)
    loss, accuracy = classifier.evaluate(test_ds, verbose=0)
    accuracies_on_test_set[f'epoch_{i + 1}_accuracy'] = accuracy

print(accuracies_on_test_set)

{'epoch_1_accuracy': 0.8235, 'epoch_2_accuracy': 0.7765, 'epoch_3_accuracy': 0.736}

Here is the rotation of the data sets:

from collections import deque

groups = ['group1', 'group2', 'group3']

for i in range(3):
    d = deque(groups)
    d.rotate(-i)
    print(list(d))

['group1', 'group2', 'group3']
['group2', 'group3', 'group1']
['group3', 'group1', 'group2']

They all take turns being the last, which is subsequently taken as the test set against all the others:

train1, train2, test1 = d

edited Oct 14 '20 at 21:10

answered Sep 21 '20 at 16:31

Nicolas Gervais

33,817
13
115
143

Excuse me if I am being daft, but has this applied the K fold or is this just to format the data more flexibly? – Joel Sep 21 '20 at 17:27
The 3 first lines of the for loop use a rolling list, where at each iteration of the loop, one of the three sets is selected and used as test set. All three parts take turns being one of these : `train1, train2, test1 = d`. I'll modify the answer to show the behavior of the rotation. – Nicolas Gervais Sep 21 '20 at 17:33
Let me know if you would like me to clarify anything else – Nicolas Gervais Sep 21 '20 at 17:37
Thank you very much for clarifying. I am running into this issue: cannot reshape array of size 1061 into shape (3,newaxis), when implementing the code. Have I simply loaded the files incorrectly? Thanks a lot in advance – Joel Sep 21 '20 at 17:44
That's just how I divided my dataset in 3 smaller datasets for simplicity. I changed it slightly to address this but the datasets don't really need to be equal length. – Nicolas Gervais Sep 21 '20 at 18:00
I see. Do I need to organise my images in a different way? Currently, I have two folders, called train and test. Within train I have a folder called cats and a folder called dogs. In my test folder, I simply have a random combination of photos of cats and dogs. Do I need to create a csv file or something like that? – Joel Sep 21 '20 at 18:13
As long as the variable `files` contains all the training pictures, it's fine. – Nicolas Gervais Sep 21 '20 at 18:27
Apologies for this. I can't migrate it to a chat as I do not have enough reputation. I have almost finished with the code but I am running into this error: "too many values to unpack (expected 3)". I have 1050 pictures, so I assume that indices should have a value of 3 rather than 1050 (as it does now when I print the len(indices). – Joel Sep 21 '20 at 19:04
That's exactly right. Indices is 3 lists of indices. – Nicolas Gervais Sep 21 '20 at 19:14
How do I fix that? Since my code is perfect to yours. – Joel Sep 21 '20 at 20:15
You should read the documentation of `glob2.glob` it will help you get filenames, that's all you need – Nicolas Gervais Sep 21 '20 at 20:18

score -1 · Answer 2 · edited Sep 21 '20 at 17:00

There is no magic here yet, only to write a wrapper to the generator or to use this workaround.

Summarizing, I suggest you to create a csv file with image names in first columns and label in second column.

after that:

import pandas as pd
from sklearn.model_selection import KFold

train_data = pd.read_csv('training_labels.csv')
for train_index, val_index in kf.split(np.zeros(n),Y):
    training_data = train_data.iloc[train_index]
    validation_data = train_data.iloc[val_index]
    train_data_generator = idg.flow_from_dataframe(training_data, directory = image_dir,
                               x_col = "filename", y_col = "label",
                               class_mode = "categorical", shuffle = True)
    valid_data_generator  = idg.flow_from_dataframe(validation_data, directory = image_dir,
                            x_col = "filename", y_col = "label",
                            class_mode = "categorical", shuffle = True)

How to use K-Fold Cross Validation For This CNN?

2 Answers2