Recovering a checkpoint after reaching NaN loss?

Question

I'm training an RNN and sometime overnight the loss function reached NaN. I've been reading that a solution to this is to decrease the learning rate. When attempting to restart training from the (only) checkpoint I have and using a smaller learning rate, I still get NaN. Does this mean my checkpoint is beyond repair? Is there a way to either recover this one OR use tf.train.Saver in such a way that I am guaranteed a version of the model before it reaches a point of no return?

If learning rate was your issue, I would expect to see NaN from the very first epoch, not after many iterations. — Ian Ash, May 07 '17 at 17:27

score 0 · Answer 1 · answered May 12 '17 at 14:15

If your checkpoint has NaN values in it, there is probably not a lot you can do to recover it. I guess you could replace the NaNs with something else, but that isn't that principled.

You probably want to see if there is an earlier checkpoint without NaN values. tf.train.Saver keeps up to 5 previous checkpoints by default, for precisely this sort of reason:

https://www.tensorflow.org/api_docs/python/tf/train/Saver

Hope this helps!

Recovering a checkpoint after reaching NaN loss?

1 Answers1