0

Adding dropout layers made the val loss remain lower than train loss, is it exceptable to have a constant generalization gap over the period? The train and val loss curve

Here is the architecture:

tf.keras.layers.CuDNNLSTM(1024,input_shape=(9,41),return_sequences=True) ,
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.4),
tf.keras.layers.CuDNNLSTM(512, return_sequences=True),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.4),
tf.keras.layers.CuDNNLSTM(256),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.4),
tf.keras.layers.Dense(3, activation=tf.nn.softmax)
Dr. Snoopy
  • 55,122
  • 7
  • 121
  • 140
Srinath
  • 19
  • 4

2 Answers2

1

This is normal when using Dropout layers, the explanation is that since Dropout adds noise to the training process, the training loss increases a little, and the increased generalization power makes the validation loss to decrease a little, creating this inverted effect you see.

And yes, its normal to have this generalization gap.

Dr. Snoopy
  • 55,122
  • 7
  • 121
  • 140
  • Thanks, by the way, how does one decide the dropout percentage? Large dropout percentage would lead to underfitting, do we just need to strike the right balance? – Srinath Jun 20 '19 at 05:06
0

It is always a better approach to interpret loss curves based on their progress irrespective of whether training loss lies above validation loss or vice-versa or amount of reasonable gap lies between them. It is completely alright to continue training even when validation loss lies above training loss and both keeps reducing [until validation loss no longer optimize further].

PS: It is always better to use dropout in deeper layers than shallow layers, the reason behind is the Partial Information Decomposition principle, as shallow layers contain synergistic information and deeper layers contain unique & redundant information.

Vigneswaran C
  • 461
  • 3
  • 14
  • What do you think is happening when this gap starts narrowing down and at one point crosses each other? Like in this image, https://ibb.co/w0tf5Yj – Srinath Jun 20 '19 at 05:11
  • Val loss corresponds to Generalization of the model. When the rate of decrease in val loss starts to reduce it means that, the model starts to settle in a minimum of loss topology. Train loss is an Empirical error, we try to reduce hoping it is commensurate to generalization error and of course, in reality, they don't perfectly follow the trend. When train loss reduce more than Val loss (as in img), it means that empirical error we try to reduce (hoping the same to reflect in generalization) no longer holds,however still if they optimize, we can continue and neglect overstatement of train_loss – Vigneswaran C Jun 20 '19 at 16:22