0

So I have found a formula describing the SGD-Descent

θ = θ-η*∇L(θ;x,y)

Where θ is a parameter, η is the learning rate and ∇L() is the gradient descent of the loss-function. But what I don't get is how the parameter θ (which should be weight and bias) can be updated mathematically? Is there a mathematical interpretation of the parameter θ?

Thanks for any answers.

erip
  • 16,374
  • 11
  • 66
  • 121
tryg
  • 13
  • 4

1 Answers1

1

That formula applies to both gradient descent and stochastic gradient descent (SGD). The difference between the two is that in SGD the loss is computed over a random subset of the training data (i.e. a mini-batch/batch) as opposed to computing the loss over all the training data as in traditional gradient descent. So in SGD x and y correspond to a subset of the training data and labels, whereas in gradient descent they correspond to all the training data and labels.

θ represents the parameters of the model. Mathematically this is usually modeled as a vector containing all the parameters of the model (all the weights, biases, etc...) arranged into a single vector. When you compute the gradient of the loss (a scalar) w.r.t. θ you get a vector containing the partial derivative of loss w.r.t. each element of θ. So ∇L(θ;x,y) is just a vector, the same size as θ. If we were to assume that the loss were a linear function of θ, then this gradient points in the direction in parameter space that would result in the maximal increase in loss with a magnitude that corresponds to the expected increase in loss if we took a step of size 1 in that direction. Since loss isn't actually a linear function and we actually want to decrease loss we instead take a smaller step in the opposite direction, hence the η and minus.

It's also worth pointing out that mathematically the form you've given is a bit problematic. We wouldn't usually write it like this since assignment and equal aren't the same thing. The equation you provided would seem to imply that the θ on the left-hand and right-hand side of the equation were the same. They are not. The θ on the left side of the equal sign represents the value of the parameters after taking a step and the θs on the right side correspond to the parameters before taking a step. We could be more clear by writing it with subscripts

enter image description here

where θ_{t} is the parameter vector at step t and θ_{t+1} is the parameter vector one step later.

jodag
  • 19,885
  • 5
  • 47
  • 66
  • 1
    Good answer. Perhaps worth mentioning too that SGD is just one of many iterative methods for optimization of a linear system under some convex loss. The "difference" between them is typically some combination of choice of learning rate and the gradient direction at each step. – erip Jan 22 '22 at 21:31