3

The input image size of u-net is 572*572, but the output mask size is 388*388. How could the image get masked with a smaller mask?

Christoph Rackwitz
  • 11,317
  • 4
  • 27
  • 36
Ink
  • 845
  • 1
  • 13
  • 31

1 Answers1

4

Probably you are referring to the scientific paper by Ronneberger et al in which the U-Net architecture was published. There the graph shows these numbers.

U-Net architecture

The explanation is a bit hidden in section "3. Training" of the paper:

Due to the unpadded convolutions, the output image is smaller than the input by a constant border width.

This means that during each convolution, part of the image is "cropped" since the convolution will start in a coordinate so that it fully overlaps with the input-image / input-blob of the layer. In case of 3x3 convolutions, this is always one pixel at each side. For more a visual explanation of kernels/convolutions see e.g. here. The output is smaller because due to the cropping occuring during unpadded convolutions only (the inner) part of the image gets a result.

It is not a general characteristic of the architecture, but something inherent to (unpadded) convolutions and can be avoided with padding. Probably the most common strategy is mirroring at the image borders, so that each convolution can start at the very edge of an image (and sees mirrored pixels in places where it's kernel overlaps). Then the input size can be preserved and the full image will be segmented.

Honeybear
  • 2,928
  • 2
  • 28
  • 47
  • So is there a post process phase after the U-Net architecture described in the original paper? Since the output is of shape (388,388,2), some mapping to the original 512x512 shape needed, right? – OmriKaduri Sep 17 '19 at 15:33
  • Yes and no: As described, the smaller shape is due to cropping. The only sensible postprocessing/mapping to the original shape is cropping the original image to the size of the output (**there is only a segmentation prediction for a center part of the image**). As already stated: You can avoid this (and get a full 512x512 prediction) with padded convolutions. Common NN implementations (Caffe, Tensorflow, ...) offer these out of the box. – Honeybear Sep 24 '19 at 10:02
  • 1
    Thank you very much for your answer. Isn't it odd that he didn't mention the fact that he's trying to predict only for the center part of the image on the paper? Or I've missed it? – OmriKaduri Sep 26 '19 at 08:03
  • ye, when I read it the first time I was confused, too. But in a lot of papers varying knowledge is prerequisite and assumed as given and therefore explanations are missing. Ronneberger apparently assumed that his given information is enough to derive the fact that he's cropping / predicting only the center part. – Honeybear Sep 30 '19 at 13:54
  • Are there any downsides to (zero) padded convolutions in comparison to "normal" convolutions? In other words: should I build my unet model according to the scientific paper or can I go with zero padded convs. The latter is used in many implementations I have found on github. – Johannes Schmidt Oct 07 '19 at 15:27
  • That's a question of it's own ;) it's already been asked, e.g. on [datascience-exchange](https://datascience.stackexchange.com/questions/13906/what-are-the-pros-and-cons-of-zero-padding-in-a-convolution-layer?rq=1) and [stackoverflow itself](https://stackoverflow.com/questions/44960987/when-to-use-what-type-of-padding-for-convolution-layers) – Honeybear Oct 09 '19 at 00:05