Keras fit_generator() doesn't train properly

Question

I am trying to create an image classifier using Keras and TensorFlow 2.0.0 backend.

I'm training this model on my local machine on a custom dataset containing a total of 17~ thousand images. The images vary in size and are located in three different folders (training, validation, and test), each containing two subfolders (one for each class). I tried an architecture similar to VGG16, which yielded more than decent results on this dataset in the past. Note, there is a minor class imbalance in the data (52:48)

When I call fit_generator(), the model doesn't train well; although the training loss lowers slightly throughout the first epoch, it does not change much afterward. Using this architecture with higher regulation, I achieved 85% accuracy after 55~ epochs in the past.

Imports and hyperparameters

import tensorflow as tf
from tensorflow import keras
from keras import backend as k
from keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten, Input, UpSampling2D
from keras.models import Sequential, Model, load_model
from keras.utils import to_categorical
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ModelCheckpoint

TRAIN_PATH = 'data/train/'
VALID_PATH = 'data/validation/'
TEST_PATH = 'data/test/'
TARGET_SIZE = (256, 256)
RESCALE = 1.0 / 255
COLOR_MODE = 'grayscale'
EPOCHS = 2
BATCH_SIZE = 16
CLASSES = ['Damselflies', 'Dragonflies']
CLASS_MODE = 'categorical'
CHECKPOINT = "checkpoints/weights.hdf5"

Model

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu',
                 input_shape=(256, 256, 1), padding='same'))

model.add(Conv2D(32, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.1))

model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.1))

model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.1))

model.add(Flatten())
model.add(Dense(516, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(128, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='Adam', metrics=['accuracy'])

In the past, I created a custom pipeline to reshape, grayscale, flip, and normalize the images; then, I trained the model using my CPU on batches of processed images.

I tried repeating the process using ImageDataGenerator, flow_from_directory, and GPU support.

# randomly flip images, and scale pixel values
trainGenerator = ImageDataGenerator(rescale=RESCALE, 
                                    horizontal_flip=True,  
                                    vertical_flip=True)

# only scale the pixel values validation images
validatioinGenerator = ImageDataGenerator(rescale=RESCALE)

# only scale the pixel values test images
testGenerator = ImageDataGenerator(rescale=RESCALE)

# instanciate train flow
trainFlow = trainGenerator.flow_from_directory(
    TRAIN_PATH,
    target_size = TARGET_SIZE,
    batch_size = BATCH_SIZE,
    classes = CLASSES,
    color_mode = COLOR_MODE,
    class_mode = CLASS_MODE,
    shuffle=True
) 

# instanciate validation flow
validationFlow = validatioinGenerator.flow_from_directory(
    VALID_PATH,
    target_size = TARGET_SIZE,
    batch_size = BATCH_SIZE,
    classes = CLASSES,
    color_mode = COLOR_MODE,
    class_mode= CLASS_MODE,
    shuffle=True
)

Then, fitting the model using fit_generator.

checkpoints = ModelCheckpoint(CHECKPOINT, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

with tf.device('/GPU:0'):
    model.fit_generator(
        trainFlow,
        validation_data=validationFlow, 
        callbacks=[checkpoints],
        epochs=EPOCHS
    )

I tried training it for 40 epochs. The classifier achieves 52% after the first epoch and does not improve as time goes by.

Testing the classifier

testFlow = testGenerator.flow_from_directory(
    TEST_PATH,
    target_size = TARGET_SIZE,
    batch_size = BATCH_SIZE,
    classes = CLASSES,
    color_mode = COLOR_MODE,
    class_mode= CLASS_MODE,
)

ans = model.predict_generator(testFlow)

When I look at the predictions, the model predicts all the test images as the majority class with the same confidence [0.48498476, 0.51501524].

Have I made sure the data is correct?

Yes. I tested whether the generators yield processed images and their corresponding labels correctly.

Have I tried changing the loss function, activation function, and optimizer?

Yes. I tried changing the class mode to binary, the loss to binary_crossentropy, and changing the last layer to produce a single output with sigmoid activation. No, I did not change the optimizer. However, I did try to increase the learning rate.

Have I tried changing the model's architecture?

Yes. I tried increasing and decreasing model complexity. Both more layers with less regularization and fewer layers with more regularization produced similar results.

Are the layers trainable?

Yes.

Is the GPU support implemented correctly?

I hope so.

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available: 1

a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') 
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') 
c = tf.matmul(a, b)

config = tf.compat.v1.ConfigProto(log_device_placement=True) 
config.gpu_options.allow_growth = True 
sess = tf.compat.v1.Session(config=config)
print(sess)

Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA GeForce GTX 1050 with Max-Q Design, pci bus id: 0000:03:00.0, compute capability: 6.1

<tensorflow.python.client.session.Session object at 0x000001F9443E2CC0>

Have I tried transfer learning?

Not yet.

I found a similar unanswered question from 2017 keras-doesnt-train-using-fit-generator.

Thoughts?

Hey, since the validation is not used to update the weights, only to inform us how the model is doing, does it matter? Does it affect the model's training process? @M.Innat — Gal Gilor, May 07 '21 at 20:56
I could not find fit_generator() for tf.keras 2.11 on the official page: https://www.tensorflow.org/api_docs/python/tf/keras/Model . There is only fit() and train_on_batch(). — HD2000, Jan 23 '23 at 11:27

score 1 · Answer 1 · answered May 07 '21 at 17:00

The problem is with your model. I copied your code and ran it on a data set I have used before (which gets high accuracy) and got results similar to yours. I then substituted the simple model below

model = tf.keras.Sequential([
    Conv2D(16, 3, padding='same', activation='relu', input_shape=(256 , 256,1)),
    MaxPooling2D(),
    Conv2D(32, 3, padding='same', activation='relu' ),
    MaxPooling2D(),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(128, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(256, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(.3),
    Dense(64, activation='relu'),
    Dropout(.3),
    Dense(2, activation='softmax')
])
model.compile(loss='categorical_crossentropy',
              optimizer='Adam', metrics=['accuracy'])

The model trained properly. By the way model.fit_generator is depreciated. You can now just use model.fit which can now handle generators. I then took your model and removed all the dropout layers except for the last one and your model trained properly. Code is:

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu',
                 input_shape=(256, 256, 1), padding='same'))

model.add(Conv2D(32, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
#model.add(Dropout(0.1))

model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
#model.add(Dropout(0.1))

model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
#model.add(Dropout(0.1))

model.add(Flatten())
model.add(Dense(516, activation='relu'))
#model.add(Dropout(0.1))

model.add(Dense(128, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='Adam', metrics=['accuracy'])

Thanks @Gerry P I tried applying your suggestion and removing the Dropout layers. Unfortunately, it did not work for me. `Epoch 1/20 loss: 0.6956 - accuracy: 0.5163 - val_loss: 0.6519 - val_accuracy: 0.5235 Epoch 2/20 loss: 0.6924 - accuracy: 0.5207 - val_loss: 0.6509 - val_accuracy: 0.5235 Epoch 3/20 loss: 0.6925 - accuracy: 0.5207 - val_loss: 0.6568 - val_accuracy: 0.5235 Epoch 4/20 loss: 0.6924 - accuracy: 0.5207 - val_loss: 0.6589 - val_accuracy: 0.5235` Also same predictions ```[0.4853577, 0.5146423]``` — Gal Gilor, May 07 '21 at 20:40
strange worked fine for me. Did you try the model I provided? I copied all your code with the exception of the line with tf.device('/GPU:0'):, so only thing I can think of is the nature of your data. If you are on Kaggle try downloading this dataset at https://www.kaggle.com/gpiosenka/beauty-detection-data-set and see if the model trains correctly on it — Gerry P, May 07 '21 at 21:16
Thank you! I found the problem. removing the import `from keras import backend as k` resolved the issue — Gal Gilor, May 08 '21 at 00:02

Gal Gilor · Answer 2 · 2021-05-10T15:00:03.347

@Gerry P,

By accident, I found what's causing the error. Removing from Keras import backend as k resolved the model's inability to learn.

That's not all. I also identified that the model you defined, not calling ModelCheckpoint, and not customizing class names affected the fitting process.

model = Sequential([
    Conv2D(16, 3, padding='same', activation='relu', input_shape=(256 , 256, 1)),
    MaxPooling2D(),
    Conv2D(32, 3, padding='same', activation='relu' ),
    MaxPooling2D(),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(128, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(256, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(.3),
    Dense(64, activation='relu'),
    Dropout(.3),
    Dense(2, activation='softmax')
])

I commented that import to try and resolve an error that occurred when I copy-pasted your sequential model. Then, I forgot to uncomment it when I tested it beautiful or average dataset. I achieved over 80% accuracy after the third epoch. Then, I reverted the changes and tried it on my dataset, and it failed again. As a bonus, not importing Keras's backend decreased the time it takes to train the model!

Lately, I had to re-install Keras and TensorFlow because they couldn't detect my GPU anymore. I probably made a mistake and installed an incompatible version of Keras.

CUDA==10.0
tensorflow-gpu==2.0.0
keras==2.3.1

Note, it's still not a 100% solution, and the problems arise every so often.

EDIT:

Whenever it doesn't work, simplify the model. Changed batch size and stopped learning? Simplify the model. Augmented the images further and stopped learning? Simplify the model.

Keras fit_generator() doesn't train properly

2 Answers2