Max-pooling vs. zero padding: Loosing spatial information

Question

When it comes to convolutional neural networks there are normally many papers recommending different strategies. I have heard people say that it is an absolute must to add padding to the images before a convolution, otherwise to much spatial information is lost. On the other hand they are happy to use pooling, normally max-pooling, to reduce the size of the images. I guess the thought here is that max pooling reduces the spatial information but also reduces the sensitivity to relative positions, so it is a trade-off?

I have heard other people saying that zero-padding does not keep more information, just more empty data. This is because by adding zeros you will not get a reaction from your kernel anyway when part of the information is missing.

I can imagine that zero-padding works if you have big kernels with "scrap values" in the edges and the source of activation centered in a smaller region of the kernel?

I would be happy to read some papers about the effect of down-sampling using pooling contra not using padding, but I cant find much about it. Any good recommendations or thoughts?

Figure: Spatial down-sampling using convolution contra pooling (Researchgate)

I'm voting to close this question as off-topic because it's a machine learning theory (not programming) question, and therefore probably belongs on http://datascience.stackexchange.com or http://stats.stackexchange.com. — mtrw, Aug 19 '16 at 11:45
Have a look at this answer [convolutional-layers-to-pad-or-not-to-pad](https://stats.stackexchange.com/questions/246512/convolutional-layers-to-pad-or-not-to-pad). That gives pretty convincing explanation. — kmario23, Nov 07 '18 at 14:12

score 0 · Answer 1 · answered Sep 17 '16 at 12:56

Adding padding is NOT an "absolute must". Sometimes it can be useful to control the size of the output so that it is not reduced by the convolution (it can also augment the output, depending on its size and kernel size). The only information that zero padding adds is the condition of border (or near-border) of the features- pixels in the limits of the input, also depending on kernel size. (You can think of it as a "passe-partout" in a picture frame)

Pooling is of MUCH MORE IMPORTANCE in convnets. Pooling is not exactly "down-sampling", or "losing spatial information". Consider first that kernel calculations have been made previous to pooling, with full spatial information. Pooling reduces dimension but keeps -hopefully- the information learnt by the kernels previously. And, by doing so, achieves one of the most interesting things about convnets; robustness to displacement, rotation or distortion of the input. Invariance, if learnt, is located even if it appears in another location or with distortions. It also implies learning through increasing scale, discovering -again, hopefully- hierarchical patterns on different scales. And of course, and also necessary in convnets, pooling makes computation possible as number of layers grows.

score 0 · Answer 2 · answered Feb 20 '17 at 01:38

I have bothered on this question for a while too, and I have also seen some papers mention this same issue. Here is a recent paper I found; Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation. I have not fully read the paper but it seems to bother on your question. I can update this answer as soon as I fully grasp the paper.

Max-pooling vs. zero padding: Loosing spatial information

2 Answers2