0

I am training a mixture density network and after a while (57 epochs) I get an error about NaN values from tf.add_check_numerics_ops()

The error message is:

dense_1/kernel/read:0 : Tensor had NaN values
 [[Node: CheckNumerics_9 = CheckNumerics[T=DT_FLOAT, message="dense_1/kernel/read:0", _device="/job:localhost/replica:0/task:0/gpu:0"](dense_1/kernel/read, ^CheckNumerics_8)]]

If I check the weights using layer.get_weights() of my dense_1 I can see that they are all not NaN.

When I try a sess.run([graph.get_tensor_by_name('dense_1/kernel/read:0)], feed_dict=stuff) I get an array the size off my weights that is just NaNs.

I don't really understand what the read operation is doing, is there some sort of caching that is having issues?

Details of the network:

(I've tried many combinations of these and they all eventually find NaNs although at different epochs.)

  • 3 hidden layers, 32, 16, 32
  • non linearity = selu, but I've tried tanh, relu, elu and selu
  • gradient clipping
  • dropout
  • happens with or without batchnorm
  • validation error is still improving when I get NaNs
  • input: 128 dimensions
  • output: mixture of 3 beta distributions in each of 64 dimensions
  • occurs with or without adversarial examples
  • I use eps=1e-7 to clip by value [eps and 1-eps]
  • I use the logsumexp trick for numerical stability

most of the relevant code can be found here:

https://gist.github.com/MarvinT/29bbeda2aecee17858e329745881cc7c

Marvin Thielk
  • 101
  • 2
  • 8

1 Answers1

0

Caused by this unsolved bug in tensorflow:

https://github.com/tensorflow/tensorflow/issues/2288

I still don't know where the NaN is getting into my gradient though...

Marvin Thielk
  • 101
  • 2
  • 8