21

I am using a CNN for a regression task. I use Tensorflow and the optimizer is Adam. The network seems to converge perfectly fine till one point where the loss suddenly increases along with the validation error. Here are the loss plots of the labels and the weights separated (Optimizer is run on the sum of them) label loss weight loss

I use l2 loss for weight regularization and also for the labels. I apply some randomness on the training data. I am currently trying RSMProp to see if the behavior changes but it takes at least 8h to reproduce the error.

I would like to understand how this can happen. Hope you can help me.

andre_bauer
  • 850
  • 2
  • 10
  • 18
  • Reduce learning rate? – Jean-François Corbett Feb 14 '17 at 08:54
  • Well normally for adam you shouldn't need to reduce the learning rate while training. A too high learning rate should cause the network to converge at a worse loss value right? After the RMSProp run I can try lower inital rate but that will mean it take even more time for this to happen i think... – andre_bauer Feb 14 '17 at 09:26
  • Wait, what is the first plot showing? It's the training-loss right? But it's going down? Where is the problem then? Can you explain? If you are speaking of combined loss, which gets then dominated by weight-regularization (that's how i interpret it), maybe play with some alpha which is setting the scale of thouse two loss-components. – sascha Feb 14 '17 at 14:10
  • Yeah the first plot is the training loss without weight loss, second is the weight loss only. Optimization is done on the sum of both! It goes down till 160k for the blue line 325k for the orange and the yellow one was about to go up as I canceled! Since the scale is a log scale the blue and orange loss double on average after the mentioned iterations which shouldn't be normal right??? – andre_bauer Feb 14 '17 at 19:17
  • 1
    Is the sum of the losses still decreasing? – drpng Feb 14 '17 at 22:03
  • No Weight and Data loss increase and so does the sum of them :/ Just Tried RSMProp and lowering the Learning rate after 300k iterations and its still increasing – andre_bauer Feb 15 '17 at 07:08
  • I had that kind of trouble when I forget to set `shuffle=True` on the `tf.train.string_input_producer()`. Basically the network was seeing _nice_ examples all the time and then, after 800k iterations, reality struck. – sunside Feb 17 '17 at 00:20
  • I am experiencing the same thing. Using tflearn and adam optimimzer. I have plenty of data. My network converges nicely both on training and validation data up until a point where both losses starts to grow up. It also takes about 8 hours of training on Tesla K80 until it starts growing up. My data is always shuffled from the very beginning of training. Does anybody ever got to the answer on what's going on here? I see that pretty much all "state of the art" networks usually uses just simple SGD + momentum. Why is this so? Why they do not use adam if it's so awesome..? Or is it? – Simanas Jun 28 '17 at 18:32
  • Please see my answer for a detailed explaination. – andre_bauer Jul 08 '17 at 09:47

1 Answers1

20

My experience over the last months is the following: Adam is very easy to use because you don't have to play with initial learning rate very much and it almost always works. However, when coming to convergence Adam does not really sattle with a solution but jiggles around at higher iterations. While SGD gives an almost perfectly shaped loss plot and seems to converge much better in higher iterations. But changing litte parts of the setup requires to adjust the SGD parameters or you will end up with NaNs... For experiments on architectures and general approaches I favor Adam, but if you want to get the best version of one chosen architecture you should use SGD and at least compare the solutions.

I also noticed that a good initial SGD setup (learning rate, weight decay etc.) converges as fast as using Adam, at leas for my setup. Hope this may help some of you!

EDIT: Please note that the effects in my initial question are NOT normal even with Adam. Seems like I had a bug but I can't really remember the issue there.

andre_bauer
  • 850
  • 2
  • 10
  • 18
  • 2
    What you are seeing is an effect of the numerical instability of Adam and other adaptive stochastic gradient descent algorithms. It's a "known bug" See https://discuss.pytorch.org/t/loss-suddenly-increases-using-adam-optimizer/11338 and https://openreview.net/forum?id=ryQu7f-RZ – Björn Lindqvist Jun 07 '20 at 19:54