-1

In multi-class logistic regression, lets say we use softmax and cross entropy. Does SGD one training example update all the weights or only a portion of the weights which are associated to the label ? For example, the label is one-hot [0,0,1] Does the whole matrix W_{feature_dim \times num_class} updated or only W^{3}_{feature_dim \times 1} updated ?

Thanks

Ancalagon BerenLuthien
  • 1,154
  • 6
  • 20
  • 29

1 Answers1

0

All of your weights are updated.

You have y = Softmax(W x + β), so to predict a y out of a single x you are making use of all your W weights. If something is used during the forward pass (prediction), then it also gets updated during the backward pass (SGD). Perhaps a more intuitive way of thinking about it is that you are essentially predicting the class membership probability for your features; assigning weight to some class means removing weight from another, so you need to update both.

Take for instance the simple case of x ∈ ℝ, y ∈ ℝ3. Then W ∈ ℝ1×3. Before activation, your prediction for some given x would look like: y= [y1 = W11x + β1, y2 = W12x + β2, y3 = W13x + β3]. You have an error signal for all of these mini-predictions, coming out of categorical crossentropy, for which you must then compute the derivative wrt the W, β terms.

I hope this is clear

KonstantinosKokos
  • 3,369
  • 1
  • 11
  • 21