'None' gradients in pytorch

Question

I am trying to implement a simple MDN that predicts the parameters of a distribution over a target variable instead of a point value, and then assigns probabilities to discrete bins of the point value. Narrowing down the issue, the code from which the 'None' springs is:

import torch

# params
tte_bins = np.linspace(
    start=0, 
    stop=399, 
    num=400, 
    dtype='float32'
).reshape(1, 1, -1)
bins = torch.tensor(tte_bins, dtype=torch.float32)
x_train = np.random.randn(1, 1024, 3)
y_labels = np.random.randint(low=0, high=399, size=(1, 1024))
y_train = np.eye(400)[y_labels]

# data
in_train = torch.tensor(x_train[0:1, :, :], dtype=torch.float)
in_train = (in_train - torch.mean(in_train)) / torch.std(in_train)
out_train = torch.tensor(y_train[0:1, :, :], dtype=torch.float)

# model
linear = torch.nn.Linear(in_features=3, out_features=2)
lin = linear(in_train)
preds = torch.exp(lin)

# intermediate values
alpha = torch.clamp(preds[0:1, :, 0:1], 0, 500)
beta = torch.clamp(preds[0:1, :, 1:2], 0, 100)

# probs
p1 = torch.exp(-torch.pow(bins / alpha, beta))
p2 = torch.exp(-torch.pow((bins + 1.0) / alpha, beta))
probs = p1 - p2

# loss
loss = torch.mean(torch.pow(out_train - probs, 2))

# gradients
loss.backward()
for p in linear.parameters():
    print(p.grad, 'gradient')

in_train has shape: [1, 1024, 3], out_train has shape: [1, 1024, 400], bins has shape: [1, 1, 400]. All the broadcasting etc.. appears find, the resulting matrices (like alpha/beta/loss) are the right shape and have the right values - there's simply no gradients

edit: added loss.backward() and x_train/y_train, now I have nans

Can you add information about your input `x_train` and `y_train`? — McLawrence, Aug 29 '18 at 03:06
you never use `y_labels` and `test`is not defined. your code should always be minimal and reproducible. — McLawrence, Aug 29 '18 at 03:23
The gradients explode when you compute `p1` and `p2`. Using `preds.sum().backward()` still produces valid gradients. I do not know what you are trying to compute with your model. However, when dcomputing the derivative of `p1`with respect to `alpha` for example, you get a multiplicative factor of `bins**(beta)`which will probably be very large. — McLawrence, Aug 29 '18 at 03:27
As I said, I don't know about your model, but at least one problem is, that when clamping `alpha`, alpha is zero sometimes. In the derivative you divide by `alpha` giving you `nan`s. — McLawrence, Aug 29 '18 at 03:40
Changed clamp to epsilon, and you are right - my gradients for alpha are huge. I'll have to think about how to rewrite this — user2780519, Aug 29 '18 at 03:48

McLawrence · Accepted Answer · 2018-08-29T07:19:49.243

0

You simply forgot to compute the gradients. While you calculate the loss, you never tell pytorch with respect to which function it should calculate the gradients.

Simply adding

loss.backward()

to your code should fix the problem.

Additionally, in your code some intermediate results like alpha are sometimes zero but are in a denominator when computing the gradient. This will lead to the nan results you observed.

edited Aug 29 '18 at 07:19

answered Aug 29 '18 at 02:21

McLawrence

4,975
7
39
51

Added, except now I get all nans as the gradient output – user2780519 Aug 29 '18 at 03:04

'None' gradients in pytorch

1 Answers1