I'm training an RNN and sometime overnight the loss function reached NaN. I've been reading that a solution to this is to decrease the learning rate. When attempting to restart training from the (only) checkpoint I have and using a smaller learning rate, I still get NaN. Does this mean my checkpoint is beyond repair? Is there a way to either recover this one OR use tf.train.Saver in such a way that I am guaranteed a version of the model before it reaches a point of no return?
Asked
Active
Viewed 483 times
0
-
If learning rate was your issue, I would expect to see NaN from the very first epoch, not after many iterations. – Ian Ash May 07 '17 at 17:27
1 Answers
0
If your checkpoint has NaN
values in it, there is probably not a lot you can do to recover it. I guess you could replace the NaNs with something else, but that isn't that principled.
You probably want to see if there is an earlier checkpoint without NaN
values. tf.train.Saver
keeps up to 5 previous checkpoints by default, for precisely this sort of reason:
https://www.tensorflow.org/api_docs/python/tf/train/Saver
Hope this helps!

Peter Hawkins
- 3,201
- 19
- 17