Why is dilated convolution computationally efficient?

Question

I have been studying UNet inspired architecture ENet and I think I follow the basic concepts. The ground-rock of efficiency of ENet is dilated convolution (apart other things). I understand the preserving spatial resolution, how it is computed and so on, however I can't understand why it is computationally and memory-wise less expensive than e.g. max-pooling.

ENet: https://arxiv.org/pdf/1606.02147.pdf

score 2 · Accepted Answer · answered Feb 16 '21 at 08:49

You simply skip computational layer with a dilated convolution layer:

For example a dilated convolution with

a filter kernel k×k = 3×3, dilation rate r = 2, stride s = 1 and no padding

is comparable to

2x downsampling followed by 3x3 convolution followed by 2x upsampling

For further reference look at the amazing paper from Vincent Dumoulin, Francesco Visin: A guide to convolution arithmetic for deep learning

Also on the github of this paper is a animation how dilated convolution works: https://github.com/vdumoulin/conv_arithmetic

score 0 · Answer 2 · answered Jul 12 '21 at 08:06

In addition to the accepted answer by @T1Berger, think of a situation where you want to capture larger features across many pixels without down-sampling which causes loss of information. The traditional way to do this would be to use larger kernels in the convolution layers. These larger kernels are computationally expensive. By using a dilated convolution layer, larger features could be extracted with less operations. This is true for Frameworks where the operations on sparse feature-maps are optimized.

Why is dilated convolution computationally efficient?

2 Answers2