Multiple executions of the evaluation gives different losses in TensorFlow

Question

I'm getting started with TensorFlow. https://www.tensorflow.org/get_started/

While I was evaluating multiple times seeing how to feed the data, I found that the loss changes with executions.

eval_input_fn = tf.contrib.learn.io.numpy_input_fn({"x":x}, y, batch_size=4,
                                          num_epochs=1)
estimator.evaluate(input_fn = eval_input_fn)

For example, I had losses following:

0.024675447 or 0.030844312 when batch_size == 2, num_epochs == 2

0.020562874 or 0.030844312 when batch_size == 4, num_epochs == 2

0.015422156 or 0.030844312 when batch_size == 4, num_epochs == 1

Is this phenomenon normal? I do not understand the principle behind it.

--- the following added

The same thing happens when I use next_batch and eval() without retraining as in https://www.tensorflow.org/get_started/mnist/pros. When I run the following cell:

# mnist.test.labels.shape: (10000, 10)
for i in range(10):
    batch = mnist.test.next_batch(1000)
    print("test accuracy %g"%accuracy.eval(feed_dict={
        x: batch[0], y_: batch[1], keep_prob: 1.0}))

I got

a)

test accuracy 0.99

test accuracy 0.997

test accuracy 0.986

test accuracy 0.993

test accuracy 0.994

test accuracy 0.993

test accuracy 0.995

test accuracy 0.99

b)

test accuracy 0.99

test accuracy 0.997

test accuracy 0.989

test accuracy 0.992

test accuracy 0.993

test accuracy 0.992

test accuracy 0.994

test accuracy 0.993

test accuracy 0.99

and they (and their average) keep changing.

score 0 · Answer 1 · answered May 02 '17 at 16:31

This is very normal, and exploited in many papers even.

First thing to note is that you're starting with randomly initialized weights. If you train many times you'll find a mean and variance to your results, often many points of accuracy difference in common classification problems. It's very normal to train multiple times and pick the best result. You should be aware that doing so overfits your model to your test data (you're picking the best one that might just have gotten lucky on that particular test data with no promise that it will generalize to other unseen data as well). This is why you use train/validation/test data. Train on train data, validate parameters on validation data with many training iterations, then only publish results on your test data that was not used in more than 1 iteration.

You also noted differences with varying batch sizes. I've found in my own experimentation that batch size is actually a regularizer. In cases when I had lots of data and didn't have overfitting issues, when I try different batch sizes the best results come from large batch sizes. However when I have little data and more need to regularize picking lower batch sizes tends to produce better results. The reason: smaller batch sizes induce more randomness in the optimization process, making it easier to escape local minima, larger batch sizes do a better job at approximating the true gradient (you're more likely to just go in the right direction at each step).

One way this issue is exploited: You can find academic papers that describe voting techniques, where people will train a neural network many times and take each of them as a single vote, these have often done very well. Take that a step further and you might choose networks that make uniquely different mistakes to get the best ensemble of networks.

Note that among the best results on the MNIST handwritten dataset listed on LeCun's page is a committee of 35 voted convolutional neural networks.

http://yann.lecun.com/exdb/mnist/

Thank you for your answer, but I think there is a misunderstanding. I separated the evaluation phase from the fitting phase, so the training is executed only once. Only this line `estimator.evaluate(input_fn = eval_input_fn)` is executed multiple times, and then I have different losses whatever batch size or number of epochs I give. Actually, I don't get why we have batch size and number of epochs in the evaluation. — Change-the-world, May 05 '17 at 04:53

Multiple executions of the evaluation gives different losses in TensorFlow

1 Answers1