How does Keras' model.evaluate()
work? In particular, how does the batch_size
argument affect the calculation?
The documentation says the loss/metrics are calculated as averages over the batches. However, as there is only one scalar output for each loss/metrics (which should represent the total average over all data, over all batch-averages), the result should not depend on batch_size
choice (at least as long as the sum of all data is dividable by the batch_size).
But after building and training a network (consisting of Conv2D, Conv2DTranspose, MaxPooling2D, and BatchNormalization, using ReLU as activations), I tried evaluating it on my test set of 60 samples:
- Evaluation with batch_size = 60 gave loss 0.1375531554222107
- Evaluation with batch_size = 10 gave loss 0.1381820539633433
- Evaluation with batch_size = 3 gave loss 0.14014312624931335
- Evaluation with batch_size = 1 gave loss 0.15437299211819966
The entire dataset (60 samples) was dividable by all of the batch_size
's here (1,3,10,60). Yet the outputs vary a lot. Could be due to batch normalization, but I don't think so because if I run the evaluation multiple times, the numbers will always be the same.
Even if there would be no shuffling before defining batches, and the numbers were supposed to always be the same, it still doesn't explain why I can't then reproduce the last number (the evaluation with batch_size=1
) by averaging individual sample evaluations. That is, why the following two lines produce a very different result:
model.evaluate(testset, testlabels, batch_size=1)
and:
losses = [model.evaluate(testset[i], testlabel[i], batch_size=1) for i in range(60)]
np.mean(losses)
Here, assume of course that model.evaluate
returns only 1 scalar, that is the final loss, and no metrics or intermediate losses. In summary - how does keras.evaluate
work internally; how are results based on different batch_size
's connected and what is their relationship to the simple averaging of per-sample losses?
This similar question doesn't help (asks after optimal choice, while I just want to know how it works).