0

I'm exploring and learning the domain of Computer Vision and am currently learning about CNNs. I fully understand the concept of CNNs i.e. uptill the Fully Connected layer.

But, when I dived into the task of image segmentation I came across the following papers:

  • Learning Deconvolution Network for Semantic Segmentation
  • Fully Convolutional Networks for Semantic Segmentation
  • U-Net: Convolutional Networks for Biomedical Image Segmentation

Here they talk about convolution and fully connected layers followed by Deconvolution and un-pooling. I understood the mathematical aspect of deconvolution and un-pooling but I'm unable to understand and most importantly visualize as to how they eventually lead to image segmentation.

Pratham Solanki
  • 337
  • 2
  • 5
  • 16

1 Answers1

1

Our goal: The task of image segmentation requires your output to have the dimensionality of your input images (but with labels instead of pixel color). You can think of it as multiple classification tasks (for each input pixel).

A typical classification CNN consists of a series of convolutions/pooling, followed by dense layers that eventually map the image to your 'label space'. This cannot work for segmentation.

A Fully Convolutional Network is one that maps an image to another image (with arbitrary number of channels) that is scaled by some factor (depending on the pooling steps that were used).

If you avoid any pooling, your output will be of the same height/width of your input (which is our goal). However we do want to reduce the size of the convolutions because: a) it is much more computationally efficient (allowing us to go deeper) b) it helps propagate information across different scales.

So we want to reduce the activations in size, and then upsample them back to the original size. This is where Deconvolutions come into play.

U-Net is a popular architecture that does the above and uses another critical concept: each time you upsample, you combine (usually either add or concatenate, not sure what they used in the actual U-Net) the upsampled activations with activations from the previous layers of the same size. This allows your network to retain the fine details that would otherwise be lost (imagine what result you would get if you had to upsample your segmentation by a factor of 16 or possibly more).

Additionally, these connections have a secondary (but important) benefit: better gradient propagation. They act similarly to the skip connections in ResNet.

Mark Loyman
  • 1,983
  • 1
  • 14
  • 23