4

To perform backward in Pytorch, we can use an optional parameter y.backward(v) to compute the Jacobian matrix multiplied by v:

x = torch.randn(3, requires_grad=True)
y = x * 2

v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)

I think that costs the same to compute the Jacobian matrix, because each node in the AD graph which is necessary to compute the Jacobian matrix is still computed. So why not Pytorch doesn't want to give us the Jacobian matrix?

LMD
  • 53
  • 4

1 Answers1

3

When you call backward() PyTorch udpdates the grad of each learnable parameter with the gradient of some loss function L w.r.t to that parameter. It has been designed with Gradient Descent [GD] (and its variants) in mind. Once the gradient has been computed you can update each parameter with x = x - learning_rate * x.grad. Indeed in the background the Jacobian has to be computed but it is not what one needs (generally) when applying GD optimization. The vector [0.1, 1.0, 0.0001] lets you reduce the output to a scalar so that x.grad will be a vector (and not a matrix, in case you do not reduce), and hence GD is well defined. You could, however, obtain the Jacobian using backward with one-hot vectors. For example, in this case:

x = torch.randn(3, requires_grad=True)
y = x * 2
J = torch.zeros(x.shape[0],x.shape[0])
for i in range(x.shape[0]):
    v = torch.tensor([1 if j==i else 0 for j in range(x.shape[0])], dtype=torch.float)
    y.backward(v, retain_graph=True)
    J[:,i] = x.grad
    x.grad.zero_()
print(J)
Gil Pinsky
  • 2,388
  • 1
  • 12
  • 17
  • Just a remark: ```x.grad``` is a vector in the reduced case instead of a matrix. – LMD Sep 09 '20 at 21:52
  • Another question: does ```retain_graph=True``` will reduce most of repeated computation? – LMD Sep 09 '20 at 21:57
  • Yes, you are correct. For your second question, it will save you from forward propagating repeatedly. retain_graph will save the graph representing the function that computes the gradient , which has been computed once you have done forward propagation. Internally each torch.Tensor is an entry point into that graph and has a ```grad_fn ``` attribute which is used to compute the backprop but it stil needs to be evaluated once you called backward. – Gil Pinsky Sep 10 '20 at 18:39