In both your examples assume we have a [height, width]
kernel applied with strides [2,2]
. That means we apply the kernel to a 2-D window of size [height, width]
on the 2-D inputs to get an output value, and then slide the window over by 2 either up or down to get the next output value.
In both cases you end up with 4x fewer outputs than inputs (2x fewer in each dimension) assuming padding='SAME'
The difference is how the output values are computed for each window:
conv2d
- the output is a linear combination of the input values times a weight for each cell in the
[height, width]
kernel
- these weights become trainable parameters in your model
max_pool
- the output is just selecting the maximum input value within the
[height, width]
window of input values
- there is no weight and no trainable parameters introduced by this operation