I have made the observation that with a
SpatialDropout2D(0.2)
layer after each of 5 Convolutional2D layers, the training and validation error is much lower during the first few epochs than with the same network without these Dropout layers (all else equal). This seems counter-intuitive, since I would expect the optimization routine to have more trouble finding a minimum if intermediate results are dropped out randomly.
So is my observation plausible? And if so, why?