0

I have few questions regarding the theory behind neural networks' gradient descent.

First question: Lets say we have 5 weights one for each of the 5 features. And now we want to compute the gradient. How does the algorithm internally do it? Does it take the first weight (=W1) and tries increasing it a bit (or decreasing it) and when it is done, goes to the 2nd weight? Or does it do it differently and more efficiently by changing simultaneously more than 1 weights?

Second question: If feature 1 is way way more important that feature 2, so the same change (in %) of W1 has a bigger effect on loss compared to W2, isn't it better to have a different learning rate for each weight? If we have only one learning rate, we set it by taking account only the most impactful weight, right?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Solmyros
  • 29
  • 3

1 Answers1

1

For question 1:

It just does gradient descent. You don't wiggle weights independently: you stack your weights in a vector/matrix/tensor W an compute and increment delta_W which itself is a (respectively) vector/matrix/tensor. Once you know this increment you apply it to all weights at once.

For question 2:

There are already many algorithms that tune the learning rate to parameters. See for example RMSprop and Adam. Those are usually (roughly said) based on the frequency at which a parameter intervenes.

Regarding the "importance" that you describe:

so the same change (in %) of W1 has a bigger effect on loss compared to W2, isn't it better to have a different learning rate for each weight

You are just describing gradient! In that case W1 has a higher gradient than W2, and it already is being updated with a higher weight, so to speak. It wouldn't make much sense though to play around with its learning rate independently unless you have more information about its role (e.g. the frequency mentinoed above).

Ash
  • 4,611
  • 6
  • 27
  • 41