Why doesn't this toy gradient descent need a different learning rate for the intercept parameter?

Question

In the below example, it's able to find the correct slope (m) but whiffs completely on the intercept (b), which always comes out to near zero. Unless I give b a 1000x learning rate.

Why does this happen? Do different types of parameters need different learning rates?

Example result without 1000x learning rate for b:

m=3.1509653303 b=0.0360896063255

Example result with 1000x learning rate for b:

m=3.14160584013 b=6.27263311371

What's going on?

N = 1000

data = [x * 3.14159 + 3.14159 * 2 for x in xrange(N)]

m_param = b_param = 0

learning_rate = .000001
b_learning_rate = learning_rate * 1000

last_total_error = float('inf')

for i in xrange(10000):
  m_grad = 0
  b_grad = 0

  total_error = 0
  for x, y in enumerate(data):
    guess = m_param * x + b_param

    err = y - guess

    total_error += err ** 2

    m_grad += -(2./N) * x * err
    b_grad += -(2./N) * err

  if last_total_error == total_error and i > 20:
    break
  last_total_error = total_error

  m_param -= m_grad * learning_rate
  b_param -= b_grad * b_learning_rate

print 'params', m_param, b_param

score 0 · Answer 1 · answered Nov 21 '17 at 21:03

0

Probably you deal with "local minimum". That's why algos based on GD needs few runs on randomly generated started values of parameters.

answered Nov 21 '17 at 21:03

exh3

1
1

Why doesn't this toy gradient descent need a different learning rate for the intercept parameter?

1 Answers1