1

In deep learning, I saw many papers apply the pre-processing step as normalization step. It normalizes the input as zero mean and unit variance before feeding to the convolutional network (has BatchNorm). Why not use original intensity? What is the benefit of the normalization step? If I used histogram matching among images, should I still use the normalization step? Thanks

Jame
  • 3,746
  • 6
  • 52
  • 101

2 Answers2

2

Normalization is important to bring features onto the same scale for the network to behave much better. Let's assume there are two features where one is measured on a scale of 1 to 10 and the second on a scale from 1 to 10,000. In terms of squared error function the network will be busy optimizing the weights according to the larger error on the second feature.

Therefore it is better to normalize.

Suleiman
  • 316
  • 1
  • 4
  • 15
  • Yes I should think so because from my understanding histogram matching will for instance balance out the intensities for your image having all the pixels in the range [100 - 255] but the values you feed to the network should normalize to the range [0 - 1] – Suleiman Jan 30 '19 at 03:16
  • So why should [0,1]. We can think [100-255] just a scale range, so histogram matching behaves as normalization – Jame Jan 30 '19 at 03:18
  • [0,1] is usually used because they are easier to deal with and for image classification since 255 is the maximum pixel value you divide by 255. Check this [link](https://stackoverflow.com/questions/20486700/why-we-always-divide-rgb-values-by-255) for more details – Suleiman Jan 30 '19 at 03:42
1

The answer to this can be found in Andrew Ng's tutorial: https://youtu.be/UIp2CMI0748?t=133.

TLDR: If you do not normalize input features, some features can have a very different scale and will slow down Gradient Descent.

Long explanation: Let us consider a model that uses two features Feature1 and Feature2 with the following ranges:

Feature1: [10,10000] Feature2: [0.00001, 0.001]

The Contour plot of these will look something like this (scaled for easier visibility). Contour plot of Feature1 and Feature2

When you perform Gradient Descent, you will calculate d(Feature1) and d(Feature2) where "d" denotes differential in order to move the model weights closer to minimizing the loss. As evident from the contour plot above, d(Feature1) is going to be significantly smaller compared to d(Feature2), so even if you choose a reasonably medium value of learning rate, then you will be zig-zagging around because of relatively large values of d(Feature2) and may even miss the global minima. Medium value of learning rate

In order to avoid this, if you choose a very small value of learning rate, Gradient Descent will take a very long time to converge and you may stop training even before reaching the global minima. Very small Gradient Descent

So as you can see from the above examples, not scaling your features lead to an inefficient Gradient Descent which results in not finding the most optimal model

Viman Deb
  • 41
  • 3