When you call backward() PyTorch udpdates the grad
of each learnable parameter with the gradient of some loss function L
w.r.t to that parameter. It has been designed with Gradient Descent [GD] (and its variants) in mind. Once the gradient has been computed you can update each parameter with x = x - learning_rate * x.grad
. Indeed in the background the Jacobian has to be computed but it is not what one needs (generally) when applying GD optimization. The vector [0.1, 1.0, 0.0001]
lets you reduce the output to a scalar so that x.grad will be a vector (and not a matrix, in case you do not reduce), and hence GD is well defined. You could, however, obtain the Jacobian using backward with one-hot vectors. For example, in this case:
x = torch.randn(3, requires_grad=True)
y = x * 2
J = torch.zeros(x.shape[0],x.shape[0])
for i in range(x.shape[0]):
v = torch.tensor([1 if j==i else 0 for j in range(x.shape[0])], dtype=torch.float)
y.backward(v, retain_graph=True)
J[:,i] = x.grad
x.grad.zero_()
print(J)