Tensorflow "model.evaluate()" giving different results each time is run on same dataset

Question

I am having different results when I run model.evaluate in Tensorflow more than once in the same validation set.

The model includes data augmentation layers, EfficientNetB0 baseline, and a GlobalAveragePooling layer (see below). I am loading the validation dataset using tf.data pipeline from tensor slices from a dataframe, and it is not being shuffled, so that the order is always the same.

def get_custom_model(input_shape, saved_model_path=None, training_base_model=True):
    input_layer = Input(shape=input_shape)

    data_augmentation = RandomFlip('horizontal')(input_layer, training=False)
    data_augmentation = RandomRotation(factor=(-0.2, 0.2))(data_augmentation, training=False)
    data_augmentation = RandomZoom(height_factor=(-0.2, 0.2))(data_augmentation, training=False)
    data_augmentation = RandomCrop(width = input_shape[0], height = input_shape[1](data_augmentation, training=False)

    baseline_model = EfficientNetB0(include_top=False, weights='imagenet')
    baseline_model.trainable = training_base_model # Added for bsg hypertuning

    baseline_output = baseline_model(data_augmentation, training=training_base_model)
    baseline_output = GlobalAveragePooling2D()(baseline_output)
    attributes_output = Dense(units=228, activation='sigmoid', name='attributes_output')(baseline_output)

    model = Model(inputs=[input_layer], outputs=[attributes_output])

    # Load weights
    if saved_model_path != None: 
        model.load_weights(saved_model_path)#.expect_partial()        
    
    return model

I am aware if I trained the model again, indeed the results might be different because some layers are initialized with random weights, but I expected the evaluation on the same model to be equal. I am running the method get_custom_model with the same saved_model_path so that every time the model loads the same weights (that were previously saved).

The metrics I am using to compare and that are different are loss, Precision, and Recall, in case they can be relevant. The optimizer is rmsprop and the loss BinaryCrossentropy. Also, I have tried changing training_base_model to False and the metrics are much poorer (almost like random weights).

PS: Also during the training, I was using the same validation set to have the validation metrics and save the best weights from them, but when I load the best weights again the results are not the same. For instance, I can get a Precision of 81.28% during the validation in a training epoch and then 57% when loading those weights and doing model.evaluate().

Your data augmentation functions all have the word "random" in their name, so your model may run on different data every time, which might explain the different results. — Arne, May 07 '21 at 09:22
Thanks for the comment, but the random layers of TensorFlow by default are only applied during training. I am specifying training to False so that they run on inference time. (source: https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/RandomCrop) — Angel Luis, May 07 '21 at 09:25
You have to include the evaluation code, including the multiple calls to evaluate and the results it produces. — Dr. Snoopy, May 07 '21 at 14:43

Tensorflow "model.evaluate()" giving different results each time is run on same dataset

0 Answers0