-3

I am writing a simple (gradient descent) code for linear regression with multi variables data set, my problem was that when I was testing the code I noticed that the cost still decreasing after 5 million iterations which means that my learning rate is small, I tried to increase it but I got overflow for the cost value, then when I normalized the data the problem had been solved and I could increase my learning rate without getting any error, I was wondering what is the relation between normalization and overflow for the cost.

gradient descent without normalization (small learning rate)

data without normalization (bigger learning rate)

data normalized with big learning rate

data before normaliztion data after normaliztion

  • Please do **not** post screenshots of code - see how to create a [mre]. – desertnaut Jul 27 '22 at 07:42
  • And this is not a programming problem, this concept is covered in any basic neural networks course. – Dr. Snoopy Jul 27 '22 at 07:49
  • thanks for the advice, I'll try to improve my presentation of the questions. I think I get the idea of how normalization help to make learning faster but I didn't get how undoing normalization causes overflow. – Karim Ahmed Jul 27 '22 at 08:15

1 Answers1

0

Basically, normalization of the inputs gives the surface of the function you want to optimize a more spherical shape. Without this normalization, differences in the scale of the variables may cause the surface to be more ellipsoidal.

Now you could ask: why speherical vs. ellipsoidal matters?
As the gradient descent is a first derivative method it is not considering the curvature of the surface when choosing the direction before taking a step. Then, having an ellipsoidal surface (more irregular curvature) can cause trouble with convergence (this bringing overflow) specially if you set a large learning rate (the algorithm is taking bigger steps at each iteration). I think it is easier to understand by looking 2d plot example. With a spherical surface the gradient points at the minimum which makes learning easier.

puigde
  • 16
  • 4