5

This is tf 2.3.0. During training, reported values for SparseCategoricalCrossentropy loss and sparse_categorical_accuracy seemed way off. I looked through my code but couldn't spot any errors yet. Here's the code to reproduce:

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

x = np.random.randint(0, 255, size=(64, 224, 224, 3)).astype('float32')
y = np.random.randint(0, 3, (64, 1)).astype('int32')

ds = tf.data.Dataset.from_tensor_slices((x, y)).batch(32)

def create_model():
  input_layer = tf.keras.layers.Input(shape=(224, 224, 3), name='img_input')
  x = tf.keras.layers.experimental.preprocessing.Rescaling(1./255, name='rescale_1_over_255')(input_layer)

  base_model = tf.keras.applications.ResNet50(input_tensor=x, weights='imagenet', include_top=False)

  x = tf.keras.layers.GlobalAveragePooling2D(name='global_avg_pool_2d')(base_model.output)

  output = Dense(3, activation='softmax', name='predictions')(x)

  return tf.keras.models.Model(inputs=input_layer, outputs=output)

model = create_model()

model.compile(
  optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
  loss=tf.keras.losses.SparseCategoricalCrossentropy(), 
  metrics=['sparse_categorical_accuracy']
)

model.fit(ds, steps_per_epoch=2, epochs=5)

This is what printed:

Epoch 1/5
2/2 [==============================] - 0s 91ms/step - loss: 1.5160 - sparse_categorical_accuracy: 0.2969
Epoch 2/5
2/2 [==============================] - 0s 85ms/step - loss: 0.0892 - sparse_categorical_accuracy: 1.0000
Epoch 3/5
2/2 [==============================] - 0s 84ms/step - loss: 0.0230 - sparse_categorical_accuracy: 1.0000
Epoch 4/5
2/2 [==============================] - 0s 82ms/step - loss: 0.0109 - sparse_categorical_accuracy: 1.0000
Epoch 5/5
2/2 [==============================] - 0s 82ms/step - loss: 0.0065 - sparse_categorical_accuracy: 1.0000

But if I double check with model.evaluate, and "manually" checking the accuracy:

model.evaluate(ds)

2/2 [==============================] - 0s 25ms/step - loss: 1.2681 - sparse_categorical_accuracy: 0.2188
[1.268101453781128, 0.21875]

y_pred = model.predict(ds)
y_pred = np.argmax(y_pred, axis=-1)
y_pred = y_pred.reshape(-1, 1)
np.sum(y == y_pred)/len(y)

0.21875

Result from model.evaluate(...) agrees on the metrics with "manual" checking. But if you stare at the loss/metrics from training, they look way off. It is rather hard to see whats wrong since no error or exception is ever thrown.

Additionally, i created a very simple case to try to reproduce this, but it actually is not reproducible here. Note that batch_size == length of data so this isnt mini-batch GD, but full batch GD (to eliminate confusion with mini-batch loss/metrics:

x = np.random.randn(1024, 1).astype('float32')
y = np.random.randint(0, 3, (1024, 1)).astype('int32')
ds = tf.data.Dataset.from_tensor_slices((x, y)).batch(1024)
model = Sequential()
model.add(Dense(3, activation='softmax'))
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(), 
    metrics=['sparse_categorical_accuracy']
)
model.fit(ds, epochs=5)
model.evaluate(ds)

As mentioned in my comment, one suspect is batch norm layer, which I dont have for the case that can't reproduce.

kawingkelvin
  • 3,649
  • 2
  • 30
  • 50
  • what `tf.__version__` is this? – Nicolas Gervais Oct 17 '20 at 22:04
  • I think i can reproduce this with https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/image_classification_from_scratch.ipynb, i removed dropout and data_aug to avoid potential confusion. the loss/metrics in that notebook is diff. I begin to wonder if there's a benign straightforward explanation – kawingkelvin Oct 17 '20 at 22:05
  • @NicolasGervais 2.3.0, I did this on google colab – kawingkelvin Oct 17 '20 at 22:06
  • I am getting a suspicion this has something to do with presence of batch norm layers in the model. I think it behaves differently depending on if is_training is true or not. – kawingkelvin Oct 17 '20 at 23:52
  • In reproducing this bug, I use very very small dataset, I wonder if batch norm could cause such a big deviation in the loss/metrics printed on progress bar vs. the real one for small set. The metrics is especially more damning than loss (i am aware loss is mini-batch vs. entire batch) since i thought it is "accumulative" via update_state(...) calls. – kawingkelvin Oct 17 '20 at 23:57

2 Answers2

0

You get different results because fit() displays the training loss as the average of the losses for each batch of training data, over the current epoch. This can bring the epoch-wise average down. And the computed loss is employed further to update the model. Whereas, evaluate() is computed using the model as it is at the end of the training, resulting in a different loss. You can check the official Keras FAQ and the related StackOverflow post.

Also, try to increase the learning rate.

  • I think you maybe partially right, but probably dont fully explain the large difference i am observing. I am able to reproduce this on https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/image_classification_from_scratch.ipynb. if this is how it is, then that progress bar loss/metrics highly not reliable as a guide – kawingkelvin Oct 17 '20 at 23:40
  • Also, to eliminate the issue of average of batch, I reproduced this with full batch gradient descent, such that 1 epoch is achieved in 1 step. I still see huge diff in the accuracy, like 1.0 vs. 0.3125. – kawingkelvin Oct 17 '20 at 23:48
  • Also, I verified sparse categorical accuracy is doing "accumulative" averaging, not only over current batch, such that at the very end, the metrics is for over the entire dataset (1 epoch). I reimplemented my own "sparse cat accuracy" out of necessity due to a bug with TPU, and confirmed this matched exactly with tf.keras.metrics.SparseCategoricalAccuracy and with the expected behavior. I am fairly confident my original issue is now entirely due to batch norm layer. – kawingkelvin Nov 01 '20 at 19:06
0

The big discrepancy seem in the metrics can be explained (or at least partially so) by presence of batch norm in the model. Will present 2 case where one is not reproducible vs. another that is reproduced if batch norm is introduced. In both case, batch_size is equal to full length of data (aka full gradient descent without 'stochastic') to minimize confusion over mini-batch statistics.

Not reproducible:

  x = np.random.randn(1024, 1).astype('float32')
  y = np.random.randint(0, 3, (1024, 1)).astype('int32')
  ds = tf.data.Dataset.from_tensor_slices((x, y)).batch(1024)

  model = Sequential()
  model.add(Dense(10, activation='relu'))
  model.add(Dense(10, activation='relu'))
  model.add(Dense(10, activation='relu'))
  model.add(Dense(3, activation='softmax'))

Reproducible:

  model = Sequential()
  model.add(Dense(10))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dense(10))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dense(10))
  model.add(BatchNormalization())
  model.add(Activation('relu'))

  model.add(Dense(3, activation='softmax'))

In fact, you can try model.predict(x), model(x, training=True) and you will see large difference in the y_pred. Also, per keras doc, this result also depend on whats in the batch. So prediction model(x[0:1], training=True) for x[0] will differ from model(x[0:2], training=True) by including an extra sample.

Probably best go to Keras doc and the original paper for the details, but I do think you will have to live with this and interprete what you see in the progress bar accordingly. It looks rather fishy if you try to use training loss/accuracy to see if you have a bias (not variance) issue. When in doubt, i think we can just run evaluate on the train set to be sure when after your model "converges" to a great minima. I sort of overlook this detail all together in my prior work 'cos underfitting (bias) is rare for deep net, and so I go by with the validation loss/metrics to determine when to stop training. But i probably would go back to the same model and evaluate on the train set (just to see if model has the capacity (not bias).

kawingkelvin
  • 3,649
  • 2
  • 30
  • 50