Why does shuffling my validation set in Keras change my model's performance?

Question

Why I'm confused:

If I test my model on examples [A, B, C], it will obtain a certain accuracy. If I test the same model on examples [C, B, A], it should obtain the same accuracy. In other words, shuffling the examples shouldn't change my model's accuracy. But that's what seems to be happening below:

Step-by-step:

Here is where I train the model:

model.fit_generator(batches, batches.nb_sample, nb_epoch=1, verbose=2,
                    validation_data=val_batches,
                    nb_val_samples=val_batches.nb_sample)

Here is where I test the model, without shuffling the validation set:

gen = ImageDataGenerator()
results = []
for _ in range(3):
    val_batches = gen.flow_from_directory(path+"valid", batch_size=batch_size*2,
                                          target_size=target_size, shuffle=False)
    result = model.evaluate_generator(val_batches, val_batches.nb_sample)
    results.append(result)

Here are the results (val_loss, val_acc):

[2.8174608421325682, 0.17300000002980231]
[2.8174608421325682, 0.17300000002980231]
[2.8174608421325682, 0.17300000002980231]

Notice that the validation accuracies are the same.

Here is where I test the model, with a shuffled validation set:

results = []
for _ in range(3):
    val_batches = gen.flow_from_directory(path+"valid", batch_size=batch_size*2,
                                          target_size=target_size, shuffle=True)
    result = model.evaluate_generator(val_batches, val_batches.nb_sample)
    results.append(result)

Here are the results (val_loss, val_acc):

[2.8174608802795409, 0.17299999999999999]
[2.8174608554840086, 0.1730000001192093]
[2.8174608268737793, 0.17300000059604645]

Notice that the validation accuracies are inconsistent, despite an unchanged validation set and an unchanged model. What's going on?

Note:

I'm evaluating on the entire validation set each time. model.evaluate_generator returns after evaluating the model on the number of examples equal to val_batches.nb_sample, which is the number of examples in the validation set.

Are you *sure* the validation set is unchanged each time? If you're shuffling and sub-sampling the data each time, that set may be different on each iteration through the testing loop. — bnaecker, Jan 24 '17 at 23:07
I don't see a problem, you told keras to shuffle the dataset, and that will slightly change the final solutions. — Dr. Snoopy, Jan 25 '17 at 00:33
I'm not subsampling the data. `evaluate_generator` returns after evaluating the model on `val_batches.nb_sample` examples, which is the total number of examples in the validation set. This is too implicit, though. I'll make it more explicit in the walkthrough. Thank you. — Matt Kleinsmith, Jan 25 '17 at 01:30

score 11 · Accepted Answer · answered Jan 25 '17 at 00:24

11

This is a really interesting problem. The answer is that it's because of that neural networks are using a float32 format which is not so accurate as float64 - the fluctuation like this are simply the realisation of an underflow phenomenon.

It case of your loss - you may notice that the differences are occuring after 7th decimal digit of a fractional part - what is exactly the precision of a float32 format. So - basically - you may assume that all numbers presented in your example are equal in terms of a float32 representation.

answered Jan 25 '17 at 00:24

Marcin Możejko

39,542
10
109
120

Assuming this is an underflow problem, why does setting `shuffle=True` cause the inconsistency while setting `shuffle=False` doesn't? Anyway, I think you're right. From Wikipedia: "no more than 9 significant decimal digits can be stored" and I see that for the accuracies, the inconsistencies start at the 10th decimal digit. I'm a little confused about the loss values (first column). Why do the inconsistencies start at the 8th decimal digit there but start at the 10th decimal digit for the accuracies (second column)? – Matt Kleinsmith Jan 25 '17 at 01:18
Significant decimal digits are for the integer part of your number. For fractional part your precision is ~7.25 digits (it's in the same article) – Marcin Możejko Jan 25 '17 at 14:03

Why does shuffling my validation set in Keras change my model's performance?

1 Answers1

Linked