3

When using a torch.nn.BCELoss() on two arguments that are both results of some earlier computation, I get some curious error, which this question is about:

RuntimeError: the derivative for 'target' is not implemented

The MCVE is as follows:

import torch
import torch.nn.functional as F

net1 = torch.nn.Linear(1,1)
net2 = torch.nn.Linear(1,1)
loss_fcn = torch.nn.BCELoss()

x = torch.zeros((1,1))

y = F.sigmoid(net1(x)) #make sure y is in range (0,1)
z = F.sigmoid(net2(y)) #make sure z is in range (0,1)

loss = loss_fcn(z, y) #works if we replace y with y.detach()

loss.backward()

It turns out if we call .detach() on y the error disappears. But this results in a different computation, now in the .backward()-pass, the gradients with respect to the second argument of the BCELoss will not be computed.

Can anyone explain what I'm doing wrong in this case? As far as I know all pytorch modules in torch.nn should support computing gradients. And this error message seems to tell me that the derivative is not implemented for y, which is somehow strange, as you can compute the gradient of y, but not of y.detach() which seems to be contradictory.

flawr
  • 10,814
  • 3
  • 41
  • 71
  • How about `loss = (loss_fcn(z, y.detach()) + loss_fcn(y, z.detach()))/2`? – Nagabhushan S N Aug 30 '21 at 18:16
  • 1
    @NagabhushanSN Note that this loss function is not symmetric with respect to the arguments, so this will not result in the desired loss. – flawr Aug 31 '21 at 07:49
  • you're right, it'll be different. But will it not serve your purpose? You need the network to update both `y` and `z` such that they come close to each other. The above loss function will achieve that right? – Nagabhushan S N Aug 31 '21 at 11:09
  • 1
    But then you could also use any other loss function or alternatively just manually implement the BCE loss:) When I asked this question I was really just curious about the behaviour of the built in `BCELoss()` function Thanks for the suggestion though! – flawr Aug 31 '21 at 11:58

2 Answers2

1

It seems I misunderstood the error message. It is not y that doesn't allow the computation for gradients, it is BCELoss() that doesn't have the ability to compute gradients with respect to the second argument. A similar problem was discussed here.

flawr
  • 10,814
  • 3
  • 41
  • 71
1

I met the same problem too. As far as I know, the second argument of BCELoss(input, target),target should be a tensor without gradient attribute. It means that target.requires_grad should be False. But I don't know why.

Usually, the target(we can also call it Ground Truth) doesn't have a gradient attribute. But the target(y in your code) was calculated by F.sigmoid(net1(x)), which means the target (output of net1) has been a tensor with gradient attribute.

so, you should try:

loss = loss_fcn(z, y.detach())

or:

loss = loss_fcn(z, y.data)

maybe this?

import torch
import torch.nn.functional as F

net1 = torch.nn.Linear(1,1)
net2 = torch.nn.Linear(1,1)
loss_fcn = torch.nn.BCELoss()

x = torch.zeros((1,1))

y = F.sigmoid(net1(x)) #make sure y is in range (0,1)
z = F.sigmoid(net2(y)) #make sure z is in range (0,1)

y.retain_grad()
a = y

loss = loss_fcn(z, a.detach()) #works if we replace y with y.detach()

loss.backward()

print(y.grad)
Nei Wu
  • 11
  • 4
  • That is what I already did (see question) but it does not solve the problem I asked about. If you do this, then the gradients with respect to the second argument will *not* be computed, and this is exactly what I was asking to avoid. – flawr Aug 08 '19 at 07:50
  • I think we are talking about different problems, here my problem is still not solved: If we call `a.detach()` then `loss.backward()` will *not* compute any gradients of `loss` with respect to the parameters of `a`. (Note that the parameters of `net1` will still have gradients, but only due to the path through the path of the first argument `z` of our loss, but not through the second argument. – flawr Aug 08 '19 at 12:06
  • To illustrate what I mean: Let `bce(x,y) = y * log(x) + (1-y) * log(1-x)`. Then the *actual* gradient would be `bce'(x,y) = [x' * (y / x - (1-y)/(1-x)), y' * log(x) - y' * log(1-x)]`. But if we use `y.detach()` then the optimizer will only use `bce'(x) = x' * (y / x - (1-y)/(1-x))`, but the second entry will be completely ignored. But that is the problem, I don't want the second entry to be ignored, I need that in the computation of my losses. – flawr Aug 08 '19 at 12:11