I am training a mixture density network and after a while (57 epochs) I get an error about NaN values from tf.add_check_numerics_ops()
The error message is:
dense_1/kernel/read:0 : Tensor had NaN values
[[Node: CheckNumerics_9 = CheckNumerics[T=DT_FLOAT, message="dense_1/kernel/read:0", _device="/job:localhost/replica:0/task:0/gpu:0"](dense_1/kernel/read, ^CheckNumerics_8)]]
If I check the weights using layer.get_weights() of my dense_1 I can see that they are all not NaN.
When I try a sess.run([graph.get_tensor_by_name('dense_1/kernel/read:0)], feed_dict=stuff) I get an array the size off my weights that is just NaNs.
I don't really understand what the read operation is doing, is there some sort of caching that is having issues?
Details of the network:
(I've tried many combinations of these and they all eventually find NaNs although at different epochs.)
- 3 hidden layers, 32, 16, 32
- non linearity = selu, but I've tried tanh, relu, elu and selu
- gradient clipping
- dropout
- happens with or without batchnorm
- validation error is still improving when I get NaNs
- input: 128 dimensions
- output: mixture of 3 beta distributions in each of 64 dimensions
- occurs with or without adversarial examples
- I use eps=1e-7 to clip by value [eps and 1-eps]
- I use the logsumexp trick for numerical stability
most of the relevant code can be found here:
https://gist.github.com/MarvinT/29bbeda2aecee17858e329745881cc7c