Different results on multiple evaluating on the GPU and model loading

Question

I've just used for the first time the ModelCheckpoint function to save the best model (best_model = True) and wanted to test its performance. When the model was saved it said that the val_acc was at 83.3% before saving. I loaded the model and used the evaluate_generator on validation_generator but the result for val_acc was 0.639. I got confused and used it again and got 0.654 and then 0.647, 0.744 and so on. I've tested the same configuration on my PC (no GPUs) and it is consistently showing same results (maybe small rounding errors sometimes)

Why are the results between different evaluate_generator executions different only on GPU?
Why is the model val_acc different from the one reported?

I am using Tensorflows implementation of Keras.

model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])
checkpointer = ModelCheckpoint(filepath='/tmp/weights.hdf5', monitor = "val_acc", verbose=1, save_best_only=True)
# prepare data augmentation configuration
train_datagen = ImageDataGenerator(
    rescale = 1./ 255,
    shear_range = 0.2,
    zoom_range = 0.2,
    horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale=1. / 255)
train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size)
validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size)
# fine-tune the model
model.fit_generator(
    train_generator,
    steps_per_epoch = math.ceil(train_samples/batch_size),
    epochs=100,
    workers = 120,
    validation_data=validation_generator,
    validation_steps=math.ceil(val_samples/batch_size),
    callbacks=[checkpointer])
model.load_weights(filepath='/tmp/weights.hdf5')
model.predict_generator(validation_generator, steps = math.ceil(val_samples/batch_size) )
temp_model = load_model('/tmp/weights.hdf5')
temp_model.evaluate_generator(validation_generator, steps = math.ceil(val_samples/batch_size), workers = 120)
>>> [2.1996076788221086, 0.17857142857142858]
temp_model.evaluate_generator(validation_generator, steps = math.ceil(val_samples/batch_size), workers = 120)
>>> [2.2661823204585483, 0.25]

Doeas your generator return the same validation set every time? Is there any randomness in how it selects the samples? — gionni, Sep 01 '17 at 09:28
As in code, I use the built in flow_from_directory. The two folders train and test are separate. I think it uses everything => no randomness. train_data_dir = '/data/datasets/NAME/train' validation_data_dir = '/data/datasets/NAME/test' — Centar 15, Sep 01 '17 at 09:35
Is `val_samples` equal to the total number of image files under `validation_data_dir`? — Yu-Yang, Sep 01 '17 at 12:28
That is actually a mistake I made, but the errors remain even with the train_generator where everything is OK. — Centar 15, Sep 01 '17 at 12:43
What if you call `validation_generator.reset()` after each of the `fit_generator()/predict_generator()/evaluate_generator()` function calls? — Yu-Yang, Sep 01 '17 at 16:18
Sadly no change. A small edit is that I am using Tensorflows implementation of Keras. — Centar 15, Sep 04 '17 at 07:16
It's very hard to debug this kind of thing just by inspecting the code. However, the CPU vs GPU behavior difference is surprising --- there should be no difference. If you can make a small, reproducible, self-contained example that acts differently between CPU and GPU on an up-to-date TensorFlow, it would probably be worth filing a TensorFlow github issue. — Peter Hawkins, Sep 07 '17 at 13:11

Wilmar van Ommeren · Answer 1 · 2017-09-01T10:50:16.543

0

It is because you only save the model weights. This means you are not saving the optimizer state which explains the difference in accuracy when you reload the model. If you add save_weights_only=False when you create the ModelCheckpoint the issue will be resolved:

If you reload the model use the load_model function of Keras. Else you will still only load the weights.

checkpointer = ModelCheckpoint(filepath='/tmp/full_model.hdf5', monitor = "val_acc", verbose=1, save_best_only=True, save_weights_only=False)

#reload model
from keras.models import load_model
model = load_model('/tmp/full_model.hdf5')

edited Sep 01 '17 at 10:50

answered Sep 01 '17 at 10:35

Wilmar van Ommeren

7,469
6
34
65

Thanks for the answer. I used the same model variable now, and imported only the weights with load_weights. The results still are different between evaluations, and are massively different from what it reportedly saved. – Centar 15 Sep 01 '17 at 10:42
don't only import the weights, you also need to import the optimizer state. Check my updated answer – Wilmar van Ommeren Sep 01 '17 at 10:46
Thanks! Would it be too much of a burthen to explain why does the optimizer state matter is it not just a way to...well.. optimize? I'll try that now and hope to get the same result - the accuracy oscillates quite wildly due to the relatively small set -> it achieves 79.1 and then drops to 0.667, I' must further reduce the learning rate. – Centar 15 Sep 01 '17 at 10:52
Sadly, did not work. >>> checkpointer = ModelCheckpoint(filepath='/tmp/weights4.hdf5', monitor = "val_acc", verbose=1, ... save_best_only=True, save_weights_only=False) >>> model.load_weights("/tmp/weights4.hdf5") evaluate1 => 0.75 evaluate2 => 0.625 – Centar 15 Sep 01 '17 at 11:07
That shouldn't be the case. Did your model continue to train after it saved the best model? For example, the best model is at the 90th epoch but you model kept training until the 100th epoch. In that case evaluate1 gives the model accuracy at the 100th epoch and evaluate2 at the 90th epoch. – Wilmar van Ommeren Sep 01 '17 at 11:43
Optimizers are updated during training to minimize the loss (https://medium.com/towards-data-science/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f). – Wilmar van Ommeren Sep 01 '17 at 11:48
No training happens between the two evaluations. Evaluate1 and 2 are just me typing evaluate_generator two times in a row. I also tried predicting and got different results between two predictions in a row (checked, even with random gathering of samples cannot find matches) and found this gem: https://stackoverflow.com/questions/43938176/why-differ-metrics-calculated-by-model-evaluate-from-tracked-metrics-during-tr I really would not want to write my own custom generator with augmentations... – Centar 15 Sep 01 '17 at 11:48
I understand that no training happens, but it could be that the model you use in evaluate1 is not the same model that was saved. Because the last epoch did not result in the 'best' model. – Wilmar van Ommeren Sep 01 '17 at 11:51
You can check this by using `model.save` after your `evaluate1`. This will definitely save the same model (which might not be the best one) – Wilmar van Ommeren Sep 01 '17 at 11:52
I understand that, I even tried your approach and still the results are different. model.evaluate_generator(validation_generator, steps = math.ceil(val_samples/batch_size), workers = 120) [1.4041728178660076, 0.66666666666666663] >>> model.evaluate_generator(validation_generator, steps = math.ceil(val_samples/batch_size), workers = 120) [1.1987533519665401, 0.70833333333333337] >>> model.save("/tmp/saved.hdf5") >>> temp_model = model.load("/tmp/saved.hdf5") >>> temp_model = load_model("/tmp/saved.hdf5") >>> temp_model.evaluate_generator(...) [0.81864879528681433, 0.8125] – Centar 15 Sep 01 '17 at 12:16
Only thing I can think off is that the validation_generator generates different input data. Can you set `numpy.random.seed(42)` before every call of `model.evaluate_generator`? – Wilmar van Ommeren Sep 01 '17 at 13:33
You can also find more on this issue here: https://github.com/fchollet/keras/issues/6499 – Wilmar van Ommeren Sep 01 '17 at 13:38
Too bad, it than it has something to do with the generator. In the article I posted the last commenter states that he uses `max_queue_size=1` to get the correct results. This is my last option, else sadly I could not help you (except for the optimizer part, which is still important if you save a model). – Wilmar van Ommeren Sep 01 '17 at 14:35

Centar 15 · Accepted Answer · 2017-09-13T13:47:43.503

0

Ok, the problem was the following - batch_size ! It took me ages to figure this one out -

steps = math.ceil(val_samples/batch_size)

Due to the fact that the batch_size was not a divisor of number_of_samples it created problems. Some small errors occured also from setting workers variable - using GPU it makes no sense to actually use it.

edited Sep 13 '17 at 13:47

answered Sep 13 '17 at 12:28

Centar 15

127
1
13

Different results on multiple evaluating on the GPU and model loading

2 Answers2