0

I am having different results when I run model.evaluate in Tensorflow more than once in the same validation set.

The model includes data augmentation layers, EfficientNetB0 baseline, and a GlobalAveragePooling layer (see below). I am loading the validation dataset using tf.data pipeline from tensor slices from a dataframe, and it is not being shuffled, so that the order is always the same.

def get_custom_model(input_shape, saved_model_path=None, training_base_model=True):
    input_layer = Input(shape=input_shape)

    data_augmentation = RandomFlip('horizontal')(input_layer, training=False)
    data_augmentation = RandomRotation(factor=(-0.2, 0.2))(data_augmentation, training=False)
    data_augmentation = RandomZoom(height_factor=(-0.2, 0.2))(data_augmentation, training=False)
    data_augmentation = RandomCrop(width = input_shape[0], height = input_shape[1](data_augmentation, training=False)

    baseline_model = EfficientNetB0(include_top=False, weights='imagenet')
    baseline_model.trainable = training_base_model # Added for bsg hypertuning

    baseline_output = baseline_model(data_augmentation, training=training_base_model)
    baseline_output = GlobalAveragePooling2D()(baseline_output)
    attributes_output = Dense(units=228, activation='sigmoid', name='attributes_output')(baseline_output)

    model = Model(inputs=[input_layer], outputs=[attributes_output])

    # Load weights
    if saved_model_path != None: 
        model.load_weights(saved_model_path)#.expect_partial()        
    
    return model

I am aware if I trained the model again, indeed the results might be different because some layers are initialized with random weights, but I expected the evaluation on the same model to be equal. I am running the method get_custom_model with the same saved_model_path so that every time the model loads the same weights (that were previously saved).

The metrics I am using to compare and that are different are loss, Precision, and Recall, in case they can be relevant. The optimizer is rmsprop and the loss BinaryCrossentropy. Also, I have tried changing training_base_model to False and the metrics are much poorer (almost like random weights).

PS: Also during the training, I was using the same validation set to have the validation metrics and save the best weights from them, but when I load the best weights again the results are not the same. For instance, I can get a Precision of 81.28% during the validation in a training epoch and then 57% when loading those weights and doing model.evaluate().

  • Your data augmentation functions all have the word "random" in their name, so your model may run on different data every time, which might explain the different results. – Arne May 07 '21 at 09:22
  • Thanks for the comment, but the random layers of TensorFlow by default are only applied during training. I am specifying training to False so that they run on inference time. (source: https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/RandomCrop) – Angel Luis May 07 '21 at 09:25
  • 1
    You have to include the evaluation code, including the multiple calls to evaluate and the results it produces. – Dr. Snoopy May 07 '21 at 14:43

0 Answers0