3

My model works when I use torch.sigmoid. I tried to make the sigmoid steeper by creating a new sigmoid function:

def sigmoid(x):
    return 1 / (1 + torch.exp(-1e5*x))

But for some reason the gradient doesn't flow through it (I get NaN). Is there a problem in my function, or is there a way to simply change the PyTorch implementation to be steeper (as my function)?

Code example:

def sigmoid(x):
  return 1 / (1 + torch.exp(-1e5*x))

a = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.58, requires_grad=True)

c = sigmoid(a-b)
c.backward()
a.grad
>>> tensor(nan)
  • Not sure if that's correlated to your problem, but when I implement this function in `numpy` and call it with -0.58 (a-b), I get a result, but also a `RuntimeWarning: overflow encountered in exp`. Probably pytorch has a problem with such a big magnitude of the exponent and results in nan? – Alex G Apr 21 '21 at 22:11

2 Answers2

2

You put a dilation of 1e5 in your exponential. The exponential of 1e5 is so unbelievably high that there is no hope to get meaningful result here. You are probably getting a NaN because you are trying to backpropagate through a computational graph which at some point is evaluated to inf (and beyond!)

Anyway, to make the slope of a function steeper, remember that df(a.x)/dx = a.df(x)/dx so you need to multiply its argument by a value greater than 1 (and not negative, you will change the sign of you derivative), but not that huge ! Try with 10 maybe, it also depends on the order of magnitude of the inputs you are going to put in your function

trialNerror
  • 3,255
  • 7
  • 18
2

The issue seems to be that when the input to your sigmoid implementation is negative, the argument to torch.exp becomes very large, causing an overflow. Using torch.autograd.set_detect_anomaly(True) as suggested here, you can see the error:

RuntimeError: Function 'ExpBackward' returned nan values in its 0th output.

If you really need to use the function you have defined, a possible workaround could be to put a conditional check on the argument (but I am not sure if it would be stable, so I cannot comment on its usefulness):

def sigmoid(x):
    if x >= 0:
        return 1./(1+torch.exp(-1e5*x))
    else:
        return torch.exp(1e5*x)/(1+torch.exp(1e5*x)) 

Here, the expression in the else branch is equivalent to the original function, by multiplying the numerator and denominator by torch.exp(1e5*x). This ensures that the argument to torch.exp is always negative or close to zero.

As noted by trialNerror, the exponent value is so high that except for values extremely close to zero, your gradient will evaluate to zero everywhere else since the actual slope will be extremely small and cannot be resolved by the data type. So if you plan to use it in a network you will likely find it very difficult to learn anything since gradients will almost always be zero. It might be better to select a smaller exponent, depending on your use case.

GoodDeeds
  • 7,956
  • 5
  • 34
  • 61
  • 1
    The branching idea is a nice way to prevent the NaN from happening, I had not thought of that. However the gradients will always be 0, because the difference between 0 and `exp(-1e5)` is far below the floating-point precision. So the NaN goes away but if one wants to backprop through it, the actual problem is really the 1e5 factor – trialNerror Apr 21 '21 at 22:41
  • @trialNerror The gradient will be zero very quickly as you move away from 0, but at and very close to zero, it is still non-zero. For example, at 0 it is 25000. I assumed that this was the intended behavior required by the OP since they wanted a very steep function, but I otherwise agree with your comment. – GoodDeeds Apr 21 '21 at 22:44
  • 1
    That makes sense! I had to change it to `1e3` just because with `1e5` the gradient was pretty much always zero so the network didn't learn. But your trick worked! Thanks : ) –  Apr 21 '21 at 23:19