How does model.evaluate in Keras work and how to recreate it manually?

Question

How does Keras' model.evaluate() work? In particular, how does the batch_size argument affect the calculation?

The documentation says the loss/metrics are calculated as averages over the batches. However, as there is only one scalar output for each loss/metrics (which should represent the total average over all data, over all batch-averages), the result should not depend on batch_size choice (at least as long as the sum of all data is dividable by the batch_size).

But after building and training a network (consisting of Conv2D, Conv2DTranspose, MaxPooling2D, and BatchNormalization, using ReLU as activations), I tried evaluating it on my test set of 60 samples:

Evaluation with batch_size = 60 gave loss 0.1375531554222107
Evaluation with batch_size = 10 gave loss 0.1381820539633433
Evaluation with batch_size = 3 gave loss 0.14014312624931335
Evaluation with batch_size = 1 gave loss 0.15437299211819966

The entire dataset (60 samples) was dividable by all of the batch_size's here (1,3,10,60). Yet the outputs vary a lot. Could be due to batch normalization, but I don't think so because if I run the evaluation multiple times, the numbers will always be the same.

Even if there would be no shuffling before defining batches, and the numbers were supposed to always be the same, it still doesn't explain why I can't then reproduce the last number (the evaluation with batch_size=1) by averaging individual sample evaluations. That is, why the following two lines produce a very different result:

    model.evaluate(testset, testlabels, batch_size=1)

and:

    losses = [model.evaluate(testset[i], testlabel[i], batch_size=1) for i in range(60)]
    np.mean(losses)

Here, assume of course that model.evaluate returns only 1 scalar, that is the final loss, and no metrics or intermediate losses. In summary - how does keras.evaluate work internally; how are results based on different batch_size's connected and what is their relationship to the simple averaging of per-sample losses?

This similar question doesn't help (asks after optimal choice, while I just want to know how it works).

How much is the difference between loss values you manually get their average and the average loss given by `evaluate`? — today, Aug 16 '19 at 11:13
So I checked again and the difference was just above 0.01 (i.e. calculating by hand gave me a loss closer to the one I got with batch_size=3). But I was thinking - is it possible this is somehow connected to my choice of loss? Im using Dice loss. (Asking because I noticed, in the example above, with different batches, while Dices vary that much, simple accuracies metrics vary only on the 6th decimal... so not much really.) — Polhek, Aug 16 '19 at 12:00

How does model.evaluate in Keras work and how to recreate it manually?

0 Answers0