Normaliztion in linear regression (gradient descent)

Question

I am writing a simple (gradient descent) code for linear regression with multi variables data set, my problem was that when I was testing the code I noticed that the cost still decreasing after 5 million iterations which means that my learning rate is small, I tried to increase it but I got overflow for the cost value, then when I normalized the data the problem had been solved and I could increase my learning rate without getting any error, I was wondering what is the relation between normalization and overflow for the cost.

gradient descent without normalization (small learning rate)

data without normalization (bigger learning rate)

data normalized with big learning rate

data before normaliztion data after normaliztion

Please do **not** post screenshots of code - see how to create a [mre]. — desertnaut, Jul 27 '22 at 07:42
And this is not a programming problem, this concept is covered in any basic neural networks course. — Dr. Snoopy, Jul 27 '22 at 07:49
thanks for the advice, I'll try to improve my presentation of the questions. I think I get the idea of how normalization help to make learning faster but I didn't get how undoing normalization causes overflow. — Karim Ahmed, Jul 27 '22 at 08:15

score 0 · Accepted Answer · answered Jul 27 '22 at 07:43

Basically, normalization of the inputs gives the surface of the function you want to optimize a more spherical shape. Without this normalization, differences in the scale of the variables may cause the surface to be more ellipsoidal.

Now you could ask: why speherical vs. ellipsoidal matters?
As the gradient descent is a first derivative method it is not considering the curvature of the surface when choosing the direction before taking a step. Then, having an ellipsoidal surface (more irregular curvature) can cause trouble with convergence (this bringing overflow) specially if you set a large learning rate (the algorithm is taking bigger steps at each iteration). I think it is easier to understand by looking 2d plot example. With a spherical surface the gradient points at the minimum which makes learning easier.

thanks for your help, I think the idea is more clear for me now. — Karim Ahmed, Jul 27 '22 at 08:19

Normaliztion in linear regression (gradient descent)

1 Answers1