Max-pooling is useful in vision for two reasons:
By eliminating non-maximal values, it reduces computation for upper layers.
It provides a form of translation invariance. Imagine cascading a max-pooling layer with a convolutional layer. There are 8 directions in which one can translate the input image by a single pixel. If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3 window, this jumps to 5/8.
Since it provides additional robustness to position, max-pooling is a “smart” way of reducing the dimensionality of intermediate representations.
I can't understand, what does 8 directions
mean? And what does
"If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3 window, this jumps to 5/8."
mean?