CheckNumerics finds Nans in "dense_1/kernel/read:0" after training MDN for a while

Question

I am training a mixture density network and after a while (57 epochs) I get an error about NaN values from tf.add_check_numerics_ops()

The error message is:

dense_1/kernel/read:0 : Tensor had NaN values
 [[Node: CheckNumerics_9 = CheckNumerics[T=DT_FLOAT, message="dense_1/kernel/read:0", _device="/job:localhost/replica:0/task:0/gpu:0"](dense_1/kernel/read, ^CheckNumerics_8)]]

If I check the weights using layer.get_weights() of my dense_1 I can see that they are all not NaN.

When I try a sess.run([graph.get_tensor_by_name('dense_1/kernel/read:0)], feed_dict=stuff) I get an array the size off my weights that is just NaNs.

I don't really understand what the read operation is doing, is there some sort of caching that is having issues?

Details of the network:

(I've tried many combinations of these and they all eventually find NaNs although at different epochs.)

3 hidden layers, 32, 16, 32
non linearity = selu, but I've tried tanh, relu, elu and selu
gradient clipping
dropout
happens with or without batchnorm
validation error is still improving when I get NaNs
input: 128 dimensions
output: mixture of 3 beta distributions in each of 64 dimensions
occurs with or without adversarial examples
I use eps=1e-7 to clip by value [eps and 1-eps]
I use the logsumexp trick for numerical stability

most of the relevant code can be found here:

https://gist.github.com/MarvinT/29bbeda2aecee17858e329745881cc7c

score 0 · Answer 1 · answered Aug 21 '17 at 17:58

0

Caused by this unsolved bug in tensorflow:

https://github.com/tensorflow/tensorflow/issues/2288

I still don't know where the NaN is getting into my gradient though...

answered Aug 21 '17 at 17:58

Marvin Thielk

101
2
8

CheckNumerics finds Nans in "dense_1/kernel/read:0" after training MDN for a while

Details of the network:

1 Answers1