All of your weights are updated.
You have y = Softmax(W x + β), so to predict a y out of a single x you are making use of all your W weights. If something is used during the forward pass (prediction), then it also gets updated during the backward pass (SGD). Perhaps a more intuitive way of thinking about it is that you are essentially predicting the class membership probability for your features; assigning weight to some class means removing weight from another, so you need to update both.
Take for instance the simple case of x ∈ ℝ, y ∈ ℝ3. Then W ∈ ℝ1×3. Before activation, your prediction for some given x would look like: y= [y1 = W11x + β1, y2 = W12x + β2, y3 = W13x + β3]. You have an error signal for all of these mini-predictions, coming out of categorical crossentropy, for which you must then compute the derivative wrt the W, β terms.
I hope this is clear