Is there a reason why a nan value appears when there is no nan value in the model parameter?

Question

I want to train the model with FP32 and perform inference with FP16.

For other networks (ResNet) with FP16, it worked.

But EDSR (super resolution) with FP16 did not work.

The differences I found are that

ReLU with inplace=True in EDSR
PixelShuffle in EDSR
No batchnorm in EDSR

I am using CUDA 11.3, python 3.8.12, pytorch 1.12.1 and cudnn 8.7.0. Is there any functions that does not support FP16 in convolutional neural network?

GPU : RTX A6000

My process is like :

net_half = net.half()
net_half.eval()
input_half = input.half()

with torch.no_grad():
     output_half = net_half(input_half)

I checked that there is no Nan in model parameters and input by checking

torch.stack([torch.isnan(p).any() for p in net_half.parameters()]).any()
torch.isnan(input_half).any()

gives False.

And by checking simple operations in EDSR :

x = torch.randn(1,4,Ny//2,Nx//2)

test_block1 = nn.Sequential(
    nn.Conv2d(4,64,kernel_size=3,padding=1),
    nn.Conv2d(64,64,kernel_size=3,padding=1,bias=True),
    nn.ReLU(True),
    nn.Conv2d(64,64,kernel_size=3,padding=1,bias=True),
    nn.Conv2d(64,64*4,kernel_size=3,padding=1,bias=True),
    nn.PixelShuffle(2),
    nn.ReLU(True),
    nn.Conv2d(64,4,kernel_size=3,padding=1)
)
x = x.half().to(device)
test_block1 = test_block1.half().to(device)
with torch.no_grad():
    y = test_block1(x)

print(y)

It does not give any Nan values.

I don't know why but I could have got the results at epoch 1 and Nan values at epoch 4.

Probably a calculation which generates a nan. Different network models do different things to achieve similar results. This is probably why the nanS show up only with EDSR. — Michael Ruth, Dec 29 '22 at 05:52
@MichaelRuth Thank you for replying. But when the model parameter does not have nan values and the input is not a nan value, can a nan value be generated due to convolution calculation? — SIwoo Lee, Dec 29 '22 at 05:59
Certainly. `float('inf')/float('inf')` evaluates to `nan`, as does `numpy.sqrt(-1)`. There exist many other operations on operands not including `nan` which evaluate to `nan`. Many libraries return `nan`S for other, sometimes seemingly arbitrary reasons. — Michael Ruth, Dec 29 '22 at 06:03
@MichaelRuth Thank you for replying. As far as I know, there are only multiplication and addition in conv2d. So even if ReLU produces many zeros, I think it would not make any Nans. In my EDSR, it has Conv2d, ReLU, PixelShuffle. So there seems to be no reason for a nan value to be calculated. Is the arbitrary reason you mentioned like cudnn versions or learning rate while training? — SIwoo Lee, Dec 29 '22 at 06:16
@MichaelRuth I'm using pytorch 1.21.1 version with GPU RTX A6000. — SIwoo Lee, Dec 29 '22 at 06:33
Bummer, there was a [bug report](https://github.com/pytorch/pytorch/issues/72594) for `nan`S with `no_grad()` which affected, at least, version 1.9.0. It appears it was fixed by version 1.10.2. — Michael Ruth, Dec 29 '22 at 06:51
@MichaelRuth Unfortunately, no_grad() deos not give errors in my case :( (the code is added) — SIwoo Lee, Dec 29 '22 at 07:08
@MichaelRuth I have checked that the results came out at the first epoch. Hence, I expect that the something happened to the weights while training. — SIwoo Lee, Dec 29 '22 at 08:32
If overflow occurs in a step (which is easy with FP16), you'll get `inf` or `-inf`. A subsequent step could result in `nan`, because `inf - inf` is `nan`. — Warren Weckesser, Dec 29 '22 at 08:57
@WarrenWeckesser You mean that the network is too deep for FP16? — SIwoo Lee, Dec 30 '22 at 02:10

Is there a reason why a nan value appears when there is no nan value in the model parameter?

0 Answers0