Addressing Saddle Points in Keras Model Training

Question

My keras model seems to have to hit a saddle point in it's training. Of course this is just an assumption; I'm not really sure. In any case, the loss stops at .0025 and nothing I have tried has worked to reduce the loss any further.

What I have tried so far is:

Using Adam and RMSProp with and without cyclical learning rates. The Results are that the loss starts and stays .0989. The learning rates for cyclical learning where .001 to .1.
After 4 or 5 epochs of not moving I tried SGD instead and the loss steadily declined too .0025. This is where the learning rate stalls out. After about 5 epochs of not changing I tried using SGD with cyclical learning enabled hoping it would decrease but I get the same result.
I have tried increasing network capacity (as well as decreasing) thinking maybe the network hit it's learning limitations. I increased all 4 dense layers to 4096. That didn't change anything.
I've tried different batch sizes.

The most epochs I have trained the network for is 7. However, for 6 of those epochs the loss or validation loss do not change. Do I need to train for more epochs or could it be that .0025 is not a saddle point but is the global minimum for my dataset? I would think there is more room for it to improve. I tested the predictions of the network at .0025 and they aren't that great.

Any advice on how to continue? My code is below.

For starters my keras model is similar in style to VGG-16:

# imports 
pip install -q -U tensorflow_addons
import tensorflow_addons as tfa
import tensorflow as tf
from tensorflow import keras
from keras import layers

def get_model(input_shape):
    input = keras.input(shape=input_shape)
    x = layers.Conv2D(filters=64, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.Conv2D(filters=64, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.MaxPooling2D(pool_size=(2, 2) strides=none, paddings="same")(x)

    x = layers.Conv2D(filters=128, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.Conv2D(filters=128, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.MaxPooling2D(pool_size=(2, 2) strides=none, paddings="same")(x)

    x = layers.Conv2D(filters=256, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.Conv2D(filters=256, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.Conv2D(filters=256, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.Conv2D(filters=256, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.MaxPooling2D(pool_size=(2, 2) strides=none, paddings="same")(x)

    x = layers.Conv2D(filters=512, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.Conv2D(filters=512, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.Conv2D(filters=512, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.Conv2D(filters=512, kernel_size= (3, 3), activation='relu', paddings="same")(input)
    x = layers.MaxPooling2D(pool_size=(2, 2) strides=none, paddings="same")(x)

    x = layers.Flatten()(x)
    x = layers.Dense(4096, activation='relu')(x)
    x = layers.Dense(2048, activation='relu')(x)
    x = layers.Dense(1024, activation='relu')(x)
    x = layers.Dense(512, activation='relu')(x)

    output = layers.Dense(9, activation='sigmoid')(x)
    return keras.models.Model(inputs=input, outputs=output)

# define learning rate range
lr_range = [.001, .1]
epochs = 100
batch_size = 32
# based on https://www.tensorflow.org/addons/tutorials/optimizers_cyclicallearningrate
steps_per_epoch = len(training_data)/batch_size
clr = tfa.optimizers.CyclicalLearningRate(initial_learning_rate=lr_range[0],
    maximal_learning_rate=lr_range[1],
    scale_fn=lambda x: 1/(2.**(x-1)),
    step_size=2 * steps_per_epoch
)
optimizer = tf.keras.optimizers.Adam(clr)

model = get_model((224, 224, 3))
model.compile(optimzer=optimzer, loss='mean_squared_error')
# used tf.dataset objects for model input
model.fit(train_ds, validation_data=valid_ds, batch_size=batch_size, epochs=epochs)

Why do you think it is a saddle point? Usually Adam or SGD is enough to move away from the saddle point (due to noisy gradient), It probably is something else. Does your label is a soft label i.e. `[1 x 9]` vector which summed to `1`? If it is a discrete one i.e. a whole number, try switching to `sparse_categorical_crossentropy` instead — Wakeme UpNow, Feb 14 '23 at 18:04
If the task is single class, classification try changing `activation='sigmoid'` to `activation='softmax'` (Assume `from_logits=False` in the crossentropy loss) — Wakeme UpNow, Feb 14 '23 at 18:10
Well, I'm a bit confused because all of the literature I'm reading says Adam usually can escape a saddle point. The labels are floating point numbers between 0 and 1. They represent vertex coordinates, points in 3D space. — junfanbl, Feb 14 '23 at 18:14
Are you trying to overfit the model? If so what happen when you reduce the training data size? — Wakeme UpNow, Feb 14 '23 at 18:29
I plan on on adding more data later on. So I'm trying to leave some room for more capacity if needed. Although my hardware does not permit that much more. Removing the relu activation does not appear to change anything. I can try and reduce data size further. Will let you know. Thank you. — junfanbl, Feb 14 '23 at 18:34
Reduced input data to 64, 64, 3. No difference in the loss. It still hovers around .0989 after 2 epochs. — junfanbl, Feb 14 '23 at 19:07

Addressing Saddle Points in Keras Model Training

0 Answers0