19

I am trying to implement the SVM loss function and its gradient. I found some example projects that implement these two, but I could not figure out how they can use the loss function when computing the gradient.

Here is the formula of loss function: enter image description here

What I cannot understand is that how can I use the loss function's result while computing gradient?

The example project computes the gradient as follows:

for i in xrange(num_train):
    scores = X[i].dot(W)
    correct_class_score = scores[y[i]]
    for j in xrange(num_classes):
      if j == y[i]:
        continue
      margin = scores[j] - correct_class_score + 1 # note delta = 1
      if margin > 0:
        loss += margin
        dW[:,j] += X[i]
        dW[:,y[i]] -= X[i] 

dW is for gradient result. And X is the array of training data. But I didn't understand how the derivative of the loss function results in this code.

Hao Tan
  • 1,530
  • 2
  • 13
  • 20
Merve Bozo
  • 439
  • 1
  • 6
  • 12

3 Answers3

9

The method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this: enter image description here

and with respect to W(j) when j!=yi is:

enter image description here

The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.

Since you are using cs231n example, you should definitely check note and videos if needed.

Hope this helps!

dexhunter
  • 578
  • 8
  • 24
  • 1
    How did they develop these formulas from the basic SVM loss? Can you please explain in more detail? Thanks – Uri Abramson Mar 29 '17 at 08:12
  • 2
    @UriAbramson Hi! This is actually basic calculus. Differentiate (w(j).T * xi - w(yi).T * xi + delta) with respect to w(yi), we get -xi, and to differentiate with respect to w(j), we get xi (when the indicator function is true for both cases). Well, since the website doesn't support equation render, it is better to check [the original note](http://cs231n.github.io/optimization-1/), and if you have trouble understanding calculus, I recommend you watching khan academy. They have great tutorial videos. I hope this helps. – dexhunter Mar 29 '17 at 12:14
  • 1
    I understand it now. I didn't figure out that the 1(.... > 0) is a condition. Thanks for the explanation, can you please explain why you need to do 2 derivatives - one w.r.t Wj and the other w.r.t Wyi? How does it work..? – Uri Abramson Mar 29 '17 at 19:52
  • Wj and Wyi are different weight vectors, Wyi is the ideal weight vector and Wj I assume is the one we are trying to form – LoveMeow Jan 26 '18 at 16:05
  • 2
    How come there is a summation when the gradient is in respect to Wyi but there is no summation when its in respect to Wj ? How does the summation just disappear. – user2076774 Aug 12 '18 at 17:29
  • 1
    I struggled to understand this. Fortunately, this one came to the rescue: https://mlxai.github.io/2017/01/06/vectorized-implementation-of-svm-loss-and-gradient-update.html – tandem May 06 '20 at 14:05
  • Shouldn't there be a summation term in both? – r4bb1t Jul 16 '20 at 23:02
0

If the substraction less than zero the loss is zero so the gradient of W is also zero. If the substarction larger than zero, then the gradient of W is the partial derviation of the loss.

BoscoTsang
  • 394
  • 1
  • 14
-1

If we don't keep these two lines of code:

dW[:,j] += X[i]
dW[:,y[i]] -= X[i] 

we get loss value.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61