6

I’m trying to get d(loss)/d(input). I know I have 2 options.

First option:

    loss.backward()
    dlossdx = x.grad.data

Second option:

    # criterion = nn.CrossEntropyLoss(reduce=False)
    # loss = criterion(y_hat, labels)     
    # No need to call backward. 
    dlossdx = torch.autograd.grad(outputs = loss,
                                  inputs = x,
                                  grad_outputs = ? )

My question is: if I use cross-entropy loss, what should I pass as grad_outputs in the second option?

Do I put d(CE)/d(y_hat)? Since pytorch crossentropy contains softmax, this will require me to pre-calculate softmax derivative using Kronecker delta.

Or do I put d(CE)/d(CE) which is torch.ones_like?

A conceptual answer is fine.

aerin
  • 20,607
  • 28
  • 102
  • 140
  • `grad_outputs` can be useful if you care about higher-order derivative products (e.g., Hessian-vector products). For standard gradients, most folks generally do not call `autograd` methods. – ZaydH Feb 01 '21 at 12:43

1 Answers1

2

Let's try to understand how both the options work.

We will use this setup

import torch 
import torch.nn as nn
import numpy as np 
x = torch.rand((64,10), requires_grad=True)
net = nn.Sequential(nn.Linear(10,10))
labels = torch.tensor(np.random.choice(10, size=64)).long()
criterion = nn.CrossEntropyLoss()

First option

loss = criterion(net(x), labels)
loss.backward(retain_graph=True)
dloss_dx = x.grad

Note that you passed no options to gradient because the loss is a scalar quantity if you compute loss as a vector, then you have to pass

Second option

dloss_dx2 = torch.autograd.grad(loss, x)

This will return a tuple and you can use the first element as the gradient of x.

Note that torch.autograd.grad return sum of dout/dx if you pass multiple outputs as tuples. But since loss is scalar, you don't need to pass grad_outputs as by default it will consider it to be one.

aerin
  • 20,607
  • 28
  • 102
  • 140
Umang Gupta
  • 15,022
  • 6
  • 48
  • 66