0

I have been working on a project related with sequence to sequence autoencoder for time series forecasting. So, I have used tf.contrib.rnn.MultiRNNCell in encoder and decoder. I am confused in which strategy used in order to regularize my seq2seq model. Should I use L2 regularization in the loss or using DropOutWrapper (tf.contrib.rnn.DropoutWrapper) in the multiRNNCell? Or can I use both strategies ... L2 for weigths and bias (projection layer) and DropOutWrapper between cells in the multiRNNCell? Thanks in advance :)

dnovai
  • 137
  • 1
  • 8

1 Answers1

0

You can use both dropout and L2 regularization at the same time as is commonly done. They are quite different types of regularization. However, I would note that recent literature has suggested that batch normalization has replaced the need for dropout as noted in the original paper on batch normalization:

https://arxiv.org/abs/1502.03167

From the abstract: "It also acts as a regularizer, in some cases eliminating the need for Dropout."

L2 regularization is typically applied when batchnorm is in use. There's nothing stopping you from applying all 3 forms of regularization, the statement above only indicates that you might not see an improvement by applying dropout when batchnorm is already in use.

There are generally optimal values for the amount of L2 regularization to apply and the dropout keep probability. These are hyperparameters you tune by trial and error or a hyperparameter search algorithm.

David Parks
  • 30,789
  • 47
  • 185
  • 328
  • Thanks Interesting paper. All best! I am going to follow your advice. I think that L2 needs an extra parameter (to be select) and dropout strategy just (generally) uses keep_prob = 0.5. I will make some benchmark experiments in order to figure out with more details. – dnovai Apr 24 '18 at 14:56
  • When I've run hyperparameter searches I've found the optimal dropout rate to be quite different from 0.5. In my experience, it depended on how much data I had. With large datasets 0.98 was once chosen as the optimal keep probability (e.g., dropout was of little benefit). – David Parks Apr 24 '18 at 15:05