Our goal: The task of image segmentation requires your output to have the dimensionality of your input images (but with labels instead of pixel color). You can think of it as multiple classification tasks (for each input pixel).
A typical classification CNN consists of a series of convolutions/pooling, followed by dense layers that eventually map the image to your 'label space'. This cannot work for segmentation.
A Fully Convolutional Network is one that maps an image to another image (with arbitrary number of channels) that is scaled by some factor (depending on the pooling steps that were used).
If you avoid any pooling, your output will be of the same height/width of your input (which is our goal). However we do want to reduce the size of the convolutions because: a) it is much more computationally efficient (allowing us to go deeper) b) it helps propagate information across different scales.
So we want to reduce the activations in size, and then upsample them back to the original size. This is where Deconvolutions come into play.
U-Net is a popular architecture that does the above and uses another critical concept: each time you upsample, you combine (usually either add or concatenate, not sure what they used in the actual U-Net) the upsampled activations with activations from the previous layers of the same size. This allows your network to retain the fine details that would otherwise be lost (imagine what result you would get if you had to upsample your segmentation by a factor of 16 or possibly more).
Additionally, these connections have a secondary (but important) benefit: better gradient propagation. They act similarly to the skip connections in ResNet.