Tensorflow neural network loss value NaN

Question

I'm trying to build a simple multilayer perceptron model on a large data set but I'm getting the loss value as nan. The weird thing is: after the first training step, the loss value is not nan and is about 46 (which is oddly low. when i run a logistic regression model, the first loss value is about ~3600). But then, right after that the loss value is constantly nan. I used tf.print to try and debug it as well.

The goal of the model is to predict ~4500 different classes - so it's a classification problem. When using tf.print, I see that after the first training step (or feed forward through MLP), the predictions coming out from the last fully connected layer seem right (all varying numbers between 1 and 4500). But then, after that the outputs from the last fully connected layer go to either all 0's or some other constant number (0 0 0 0 0).

For some information about my model:

3 layer model. all fully connected layers.
batch size of 1000
learning rate of .001 (i also tried .1 and .01 but nothing changed)
using CrossEntropyLoss (i did add an epsilon value to prevent log0)
using AdamOptimizer
learning rate decay is .95

The exact code for the model is below: (I'm using the TF-Slim library)

input_layer = slim.fully_connected(model_input, 5000, activation_fn=tf.nn.relu)
hidden_layer = slim.fully_connected(input_layer, 5000, activation_fn=tf.nn.relu)
output = slim.fully_connected(hidden_layer, vocab_size, activation_fn=tf.nn.relu)
output = tf.Print(output, [tf.argmax(output, 1)], 'out = ', summarize = 20, first_n = 10)
return {"predictions": output}

Any help would be greatly appreciated! Thank you so much!

score 3 · Accepted Answer · answered May 19 '17 at 09:11

3

Two (possibly more) reasons why it doesn't work:

You skipped or inappropriately applied feature scaling of your inputs and outputs. Consequently, data may be difficult to handle for Tensorflow.
Using ReLu, which is a discontinuous function, may raise issues. Try using other activation functions, such as tanh or sigmoid.

answered May 19 '17 at 09:11

OZ13

256
3
4

Thanks so much for your comment! Changing the last layer to a sigmoid actually did fix the NaN loss error. Now, the loss value initially seems normal (roughly 3600) but then quickly drops back down to ~23 or so which is very bizarre. Would you be able to elaborate on your first point on feature scaling? Thank you! – dooder May 20 '17 at 01:58
Sure. Imagine a situation, in which you have 2 features set on different scales (e.g. house prices x1: ~1e6 dollars and area x2: ~10-100 m2). When not scaled, you risk that your optimization algorithm will bounce forth and back on the "steepest" (hyper-)surface, which may even end up in not converging (hence possible NaN). For the given example you can imagine (x1, x2) surface having such a steep valley. Now, if you apply feature scaling (xi <- (xi - mean(xi)) / std(xi)) you have them all operating on similar scales, centred around 0 mean with unitary stddev. Do you see the point? – OZ13 May 21 '17 at 10:51
Oh ok, so it's meant as a way to compare features of different units. I'm not sure I understood what you meant by "when not scaled, you risk that your optimization function will bounce back and fort on the steepest hyper surface". Why would it bounce back and forth? I may be missing something – dooder May 22 '17 at 07:24
It has nothing to do with units. Let's say you have two features, so your loss function J(x1, x2) looks like a normal surface: it has hills and valleys. If you use optimizer such as gradient descent, it will try to figure out the steepest way down, and make a step there, to reduce it (min J(x1, x2)). What happens if x1 is of the order of milions and x2 of hundreds? Then x1 is gonna change a LOT compare to x2, so that will create sharp valleys. Every step of -grad J, can now "jump" across such valley and likely not end up at the bottom (convergence issue, NaN...). – OZ13 May 22 '17 at 09:07
Oh, OK. I understand it now. So scaling my features would be the way to prevent that from happening. Thank you so much for your help! – dooder May 23 '17 at 04:30
It is not a guarantee for success of course, but a high chance. Feature scaling makes changes in each variable xi proportional to one another, making it considerably easier for the optimizer to converge. – OZ13 May 23 '17 at 07:59

score 0 · Answer 2 · answered May 19 '17 at 08:28

0

For some reasons, your training process has diverged, and you may have infinite values in your weights, wich gives NaN losses. The reasons can be many, try changing your training parameters (use smaller batchs for test).

Also, using a relu for the last output in a classifier is not the usual method, try using a sigmoid.

answered May 19 '17 at 08:28

Arnaud De Broissia

659
3
10

Thank you so much for your help! Changing the last layer to a sigmoid actually did fix the NaN loss error. Despite that, all of the predicted classes are still [0 0 0 0 0 0 0] or [5 5 5 5 5] for some constant still. Just wondering, how do you know when your training process has diverged? I noticed that my loss values sometimes go up and down – dooder May 20 '17 at 01:56

score 0 · Answer 3 · answered May 21 '17 at 04:35

0

From my understanding Relu doesn't put a cap on the upper bound for Neural Networks so its more likely to deconverge depending upon its implementation.

Try switching all the activation functions to tanh or sigmoid. Relu is generally used for convolution in cnns.

Its also difficult to determine if your deconverging due to cross entropy as we don't know how you effected it with your epsilon value. Try just using the residual its much simpler but still effective.

Also a 5000-5000-4500 neural network is huge. Its unlikely you actually need a network that large.

answered May 21 '17 at 04:35

Jjoseph

206
2
9

Thanks for your comment! I will change the activation functions. As for the number of neurons, I had 5000 because there are ~4500 output classes. So, based on my beliefs, there should be more neurons than the number of output classes. Or am I missing something? Thanks! – dooder May 22 '17 at 07:20
A lot of people recommend having your hidden layer node count be about inbetween the numb_inputs and numb_outputs. As just an anecdotal point on the MNIST dataset using a standard FeedForward Model I was able to getan accuracy of 91% using only 8 hidden nodes. (784inputs, 8 hidden nodes, 10 outputs) but I wasn't able to get above 97% s accuracy until I created a network with about 280 hidden Nodes. I don't believe the number of nodes needed scales linearly however because your number of connections scales exponentially. IE numb_weights layer_n * layer_n+1. – Jjoseph May 22 '17 at 22:22
There was a research paper I think by Alex Graves? That states that increasing the number of neurons helps with training but once converged you generally only need a much fraction of the number nodes originally trained. I don't recall the source right now but I'll see if I can dig it up and link it when I can. I would try with a network as small as 800 and start increasing nodes from there if your accuracy is still too low. Is your net still not converging? – Jjoseph May 22 '17 at 22:26
Thank you so much for your comments! It is converging now. The loss value gradually goes down but sometimes will randomly spike up an down - I assumed that's just part of how a NN is. I understand your comments about the number of nodes, it helped a lot! In this problem, I have roughly ~1000 input features and ~4500 possible output classes. I've been trying with 2000 hidden neurons with 1 and 2 hidden layers, 3000 hidden neurons, and 5000 hidden neurons. None of them seem to perform as well as a simple logistic regression model I have. Theoretically, MLP should be better though right? – dooder May 23 '17 at 04:54
Assuming its a nonlinear function you're solving for yes. The arbitrary spike is normal. How many iterations over the training set are you doing? Its quite common do have to do 100+ iterations even when learning rate are set as high as .3 Also if you have a learning rate too low it may take many, many iterations to converge to a decent minimum. Likewise if you have a learning rate too high it may be causing the spikes in error (and may increase chance of deconvergence) – Jjoseph May 23 '17 at 12:55

Tensorflow neural network loss value NaN

3 Answers3