I have few questions regarding the theory behind neural networks' gradient descent.
First question: Lets say we have 5 weights one for each of the 5 features. And now we want to compute the gradient. How does the algorithm internally do it? Does it take the first weight (=W1) and tries increasing it a bit (or decreasing it) and when it is done, goes to the 2nd weight? Or does it do it differently and more efficiently by changing simultaneously more than 1 weights?
Second question: If feature 1 is way way more important that feature 2, so the same change (in %) of W1 has a bigger effect on loss compared to W2, isn't it better to have a different learning rate for each weight? If we have only one learning rate, we set it by taking account only the most impactful weight, right?