I have been studying UNet inspired architecture ENet and I think I follow the basic concepts. The ground-rock of efficiency of ENet is dilated convolution (apart other things). I understand the preserving spatial resolution, how it is computed and so on, however I can't understand why it is computationally and memory-wise less expensive than e.g. max-pooling.
2 Answers
You simply skip computational layer with a dilated convolution layer:
For example a dilated convolution with
- a filter kernel k×k = 3×3, dilation rate r = 2, stride s = 1 and no padding
is comparable to
- 2x downsampling followed by 3x3 convolution followed by 2x upsampling
For further reference look at the amazing paper from Vincent Dumoulin, Francesco Visin: A guide to convolution arithmetic for deep learning
Also on the github of this paper is a animation how dilated convolution works: https://github.com/vdumoulin/conv_arithmetic

- 445
- 3
- 10
In addition to the accepted answer by @T1Berger, think of a situation where you want to capture larger features across many pixels without down-sampling which causes loss of information. The traditional way to do this would be to use larger kernels in the convolution layers. These larger kernels are computationally expensive. By using a dilated convolution layer, larger features could be extracted with less operations. This is true for Frameworks where the operations on sparse feature-maps are optimized.

- 1,638
- 15
- 20