What is the problem with this SGD loss graph?

Question

I've been trying to train audio classification model. When i used SGD with learning_rate=0.01, momentum=0.0 and nesterov=False i get the following Loss and Accuracy graphs:

I can't figure out what what causes the instant decrease in loss at around epoch 750. I tried different learning rates, momentum values and their combinations, different batch sizes, initial layer weights etc. to get more appropriate graph but no luck at all. So if you have any knowledge about what causes this please let me know.

Code i used for this training is below:

# MFCCs Model
x = tf.keras.layers.Dense(units=512, activation="sigmoid")(mfcc_inputs)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Dense(units=256, activation="sigmoid")(x)
x = tf.keras.layers.Dropout(0.5)(x)


# Spectrograms Model
y = tf.keras.layers.Conv2D(32, kernel_size=(3,3), strides=(2,2))(spec_inputs)
y = tf.keras.layers.AveragePooling2D(pool_size=(2,2), strides=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("sigmoid")(y)

y = tf.keras.layers.Conv2D(64, kernel_size=(3,3), strides=(1,1), padding="same")(y)
y = tf.keras.layers.AveragePooling2D(pool_size=(2,2), strides=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("sigmoid")(y)

y = tf.keras.layers.Conv2D(64, kernel_size=(3,3), strides=(1,1), padding="same")(y)
y = tf.keras.layers.AveragePooling2D(pool_size=(2,2), strides=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("sigmoid")(y)

y = tf.keras.layers.Flatten()(y)
y = tf.keras.layers.Dense(units=256, activation="sigmoid")(y)
y = tf.keras.layers.Dropout(0.5)(y)


# Chroma Model
t = tf.keras.layers.Dense(units=512, activation="sigmoid")(chroma_inputs)
t = tf.keras.layers.Dropout(0.5)(t)
t = tf.keras.layers.Dense(units=256, activation="sigmoid")(t)
t = tf.keras.layers.Dropout(0.5)(t)


# Merge Models
concated = tf.keras.layers.concatenate([x, y, t])


# Dense and Output Layers
z = tf.keras.layers.Dense(64, activation="sigmoid")(concated)
z = tf.keras.layers.Dropout(0.5)(z)
z = tf.keras.layers.Dense(64, activation="sigmoid")(z)
z = tf.keras.layers.Dropout(0.5)(z)
z = tf.keras.layers.Dense(1, activation="sigmoid")(z)


mdl = tf.keras.Model(inputs=[mfcc_inputs, spec_inputs, chroma_inputs], outputs=z)
mdl.compile(optimizer=SGD(), loss="binary_crossentropy", metrics=["accuracy"])
mdl.fit([M_train, X_train, C_train], y_train, batch_size=8, epochs=1000, validation_data=([M_val, X_val, C_val], y_val), callbacks=[tensorboard_cb])

I tried relu, softmax, tanh, sigmoid etc.. Sigmoid and SGD combination yield best result. — Alp Sarıca, Jul 29 '21 at 21:11

score 0 · Answer 1 · answered Jul 29 '21 at 22:30

I'm not too sure myself, but as Frightera said, sigmoid activations in hidden layers can cause trouble since it is more sensitive to weight initialization, and if the weights aren't perfectly set, it can cause gradients to be very small. Perhaps the model eventually deals with the small sigmoid gradients and loss finally decreases around epoch 750, but just my hypothesis. If ReLU doesn't work, try using LeakyReLU since it doesn't have the dead neuron effect that ReLU does.

What is the problem with this SGD loss graph?

1 Answers1