2

I have been studying UNet inspired architecture ENet and I think I follow the basic concepts. The ground-rock of efficiency of ENet is dilated convolution (apart other things). I understand the preserving spatial resolution, how it is computed and so on, however I can't understand why it is computationally and memory-wise less expensive than e.g. max-pooling.

ENet: https://arxiv.org/pdf/1606.02147.pdf

Christoph Rackwitz
  • 11,317
  • 4
  • 27
  • 36

2 Answers2

2

You simply skip computational layer with a dilated convolution layer:

For example a dilated convolution with

  • a filter kernel k×k = 3×3, dilation rate r = 2, stride s = 1 and no padding

is comparable to

  • 2x downsampling followed by 3x3 convolution followed by 2x upsampling

For further reference look at the amazing paper from Vincent Dumoulin, Francesco Visin: A guide to convolution arithmetic for deep learning

Also on the github of this paper is a animation how dilated convolution works: https://github.com/vdumoulin/conv_arithmetic

T1Berger
  • 445
  • 3
  • 10
0

In addition to the accepted answer by @T1Berger, think of a situation where you want to capture larger features across many pixels without down-sampling which causes loss of information. The traditional way to do this would be to use larger kernels in the convolution layers. These larger kernels are computationally expensive. By using a dilated convolution layer, larger features could be extracted with less operations. This is true for Frameworks where the operations on sparse feature-maps are optimized.

YScharf
  • 1,638
  • 15
  • 20