I am having different results when I run model.evaluate in Tensorflow more than once in the same validation set.
The model includes data augmentation layers, EfficientNetB0 baseline, and a GlobalAveragePooling layer (see below). I am loading the validation dataset using tf.data pipeline from tensor slices from a dataframe, and it is not being shuffled, so that the order is always the same.
def get_custom_model(input_shape, saved_model_path=None, training_base_model=True):
input_layer = Input(shape=input_shape)
data_augmentation = RandomFlip('horizontal')(input_layer, training=False)
data_augmentation = RandomRotation(factor=(-0.2, 0.2))(data_augmentation, training=False)
data_augmentation = RandomZoom(height_factor=(-0.2, 0.2))(data_augmentation, training=False)
data_augmentation = RandomCrop(width = input_shape[0], height = input_shape[1](data_augmentation, training=False)
baseline_model = EfficientNetB0(include_top=False, weights='imagenet')
baseline_model.trainable = training_base_model # Added for bsg hypertuning
baseline_output = baseline_model(data_augmentation, training=training_base_model)
baseline_output = GlobalAveragePooling2D()(baseline_output)
attributes_output = Dense(units=228, activation='sigmoid', name='attributes_output')(baseline_output)
model = Model(inputs=[input_layer], outputs=[attributes_output])
# Load weights
if saved_model_path != None:
model.load_weights(saved_model_path)#.expect_partial()
return model
I am aware if I trained the model again, indeed the results might be different because some layers are initialized with random weights, but I expected the evaluation on the same model to be equal. I am running the method get_custom_model with the same saved_model_path so that every time the model loads the same weights (that were previously saved).
The metrics I am using to compare and that are different are loss, Precision, and Recall, in case they can be relevant. The optimizer is rmsprop and the loss BinaryCrossentropy. Also, I have tried changing training_base_model to False and the metrics are much poorer (almost like random weights).
PS: Also during the training, I was using the same validation set to have the validation metrics and save the best weights from them, but when I load the best weights again the results are not the same. For instance, I can get a Precision of 81.28% during the validation in a training epoch and then 57% when loading those weights and doing model.evaluate().