Loss in Tensorflow suddenly turn into nan

Question

When I using tensorflow, the loss suddenly turn into nan, just like:

Epoch:  00001 || cost= 0.675003929
Epoch:  00002 || cost= 0.237375346
Epoch:  00003 || cost= 0.204962473
Epoch:  00004 || cost= 0.191322120
Epoch:  00005 || cost= 0.181427178
Epoch:  00006 || cost= 0.172107664
Epoch:  00007 || cost= 0.171604740
Epoch:  00008 || cost= 0.160334495
Epoch:  00009 || cost= 0.151639721
Epoch:  00010 || cost= 0.149983061
Epoch:  00011 || cost= 0.145890004
Epoch:  00012 || cost= 0.141182279
Epoch:  00013 || cost= 0.140914166
Epoch:  00014 || cost= 0.136189088
Epoch:  00015 || cost= 0.133215346
Epoch:  00016 || cost= 0.130046664
Epoch:  00017 || cost= 0.128267926
Epoch:  00018 || cost= 0.125328618
Epoch:  00019 || cost= 0.125053261
Epoch:  00020 || cost= nan
Epoch:  00021 || cost= nan
Epoch:  00022 || cost= nan
Epoch:  00023 || cost= nan
Epoch:  00024 || cost= nan
Epoch:  00025 || cost= nan
Epoch:  00026 || cost= nan
Epoch:  00027 || cost= nan

And the main training code is:

for epoch in range(1000):
    Mcost = 0

    temp = []
    for i in range(total_batch):
        batch_X = X[i*batch_size:(i+1)*batch_size]
        batch_Y = Y[i*batch_size:(i+1)*batch_size]
        solver, c, pY = sess.run([train, cost, y_conv], feed_dict={x: batch_X, y_: batch_Y, keep_prob:0.8})
        Mcost = Mcost + c

    print("Epoch: ", '%05d'%(epoch+1), "|| cost=",'{:.9f}'.format(Mcost/total_batch))

Since the cost is OK at the first 19 epoch, I believe that the network and the input is OK. For the network, I use 4 CNN, the activate function is relu, and the last layer is full connect without the activate function.

Also, I have known that 0/0 or log(0) will result in nan. But, my loss function is:

c1 = y_conv - y_
c2 = tf.square(c1)
c3 = tf.reduce_sum(c2,1)
c4 = tf.sqrt(c3)
cost = tf.reduce_mean(c4)

I run the tensorflow with GPU GTX 1080.

Any suggestion is appreciate.

P-Gn · Accepted Answer · 2017-06-12T20:07:48.053

Quite often, those NaN come from a divergence in the optimization due to increasing gradients. They usually don't appear at once, but rather after a phase where the loss increases suddenly and within a few steps reaches inf. The reason you do not see this explosive increase is probably because you check your loss only every epoch -- try to display your loss every step or every few steps and you are likely to see this phenomenon.

As to why your gradient exploses suddenly, I would suggest you try without tf.sqrt in your loss function. This should be more numerically stable. tf.sqrt has the bad property of having an exploding gradient near zero. This means increasing risks of divergence once you get close to a solution -- looks a lot like what you are observing.

Thank you for your suggestion. And the tf.sqrt may be the question. However, when I exclude the tf.sqrt, the decrease of loss is very slow. I don't know the reason. Then, I change c2 = tf.square(c1) to c2 = tf.square(c1) + 1. And the decrease of loss is OK. — Qiang Zhang, Jun 14 '17 at 04:28

Loss in Tensorflow suddenly turn into nan

1 Answers1

Linked