Why is the derivative of f(x) with respect to 'x' 'x' and not 1 in pytorch?

Question

I am trying to understand pytorch's autograd in full and I stumbled with this: let f(x)=x, from basic maths we know that f'(x)=1, however when I do that exercise in pytorch I get that f'(x) = x.

z = torch.linspace(-1, 1, steps=5, requires_grad=True)
y = z
y.backward(z)
print("Z tensor is: {} \n Gradient of y with respect to z is: {}".format(z, z.grad))

I would expect to get a tensor of size 5 full of 1 but instead I get:

Z tensor is: tensor([-1.0000, -0.5000,  0.0000,  0.5000,  1.0000], requires_grad=True) 
 Gradient of y with respect to z is: tensor([-1.0000, -0.5000,  0.0000,  0.5000,  1.0000])

Why is this the behavior of pytorch?

Danny Fang · Accepted Answer · 2019-04-10T19:21:15.310

First of all, given z = torch.linspace(-1, 1, steps=5, requires_grad=True) and y = z, the function is a vector-valued function, so the derivative of y w.r.t z is not as simple as 1 but a Jacobian matrix. Actually in your case z = [z1, z2, z3, z4, z5]T , the upper case T means z is a row vector. Here is what the official doc says:

Secondly, notice the official doc says: Now in this case y is no longer a scalar. torch.autograd could not compute the full Jacobian directly, but if we just want the vector-Jacobian product, simply pass the vector to backward as argument link. In that case x.grad is not the actual gradient value (matrix) but the vector-Jacobian product.

EDIT: x.grad is the actual gradient if your output y is a scalar. See the example here:

z = torch.linspace(-1, 1, steps=5, requires_grad=True)
y = torch.sum(z)
y.backward()
z.grad

This will output:

tensor([1., 1., 1., 1., 1.])

As you can see, it is the gradient. Notice the only difference is that y is a scalar value here while a vector value in your example. grad can be implicitly created only for scalar outputs

You might wonder what if the gradient is not a constant, like dependent on input z as in this case

z = torch.linspace(-1, 1, steps=5, requires_grad=True)
y = torch.sum(torch.pow(z,2))
y.backward()
z.grad

The output is:

tensor([-2., -1.,  0.,  1.,  2.])

It is the same as

z = torch.linspace(-1, 1, steps=5, requires_grad=True)
y = torch.sum(torch.pow(z,2))
y.backward(torch.tensor(1.))
z.grad

The blitz tutorial is kind of brief so it is actually quite hard to understand for beginners.

Thanks! I had read the documentation before but I wasn't completely sure of what it meant. I think it is slightly misleading calling the vector-jacobian product just 'grad'. — Sergio, Apr 10 '19 at 14:28

score 0 · Answer 2 · answered Apr 10 '19 at 13:42

After discussing with a colleague, he found that the 'backward()' method actually is multiplying the gradient evaluated at z, with z itself. This makes sense for neural network applications. A short code snippet to understand this is the following:

z = torch.linspace(1, 5, steps=5, requires_grad=True)
y = torch.pow(z,2)
y.backward(z)
print("Z tensor is: {} \n Gradient of y with respect to z is: {}".format(z, z.grad/z))

The output is:

Z tensor is: tensor([1., 2., 3., 4., 5.], requires_grad=True) 
 Gradient of y with respect to z is: tensor([ 2.,  4.,  6.,  8., 10.], grad_fn=<DivBackward0>)

In this case, you can see that z.grad divided by z is the actual expected gradient of z which would be 2*z.

Why is the derivative of f(x) with respect to 'x' 'x' and not 1 in pytorch?

2 Answers2