Training loss explodes from first training example and then outputs nan

Question

I'm a novice at deep learning, I have built some basic CNNs, but this time I'm trying to build FCN(Fully Convolutional Network) similar to Yolo3. my network contains 32 layers with LeakyRelu as the activation function and adam optimizer. there are 680 data samples and the input image size is 416x416 same as yolo model. I have given some snipes of my code below.

I'm using Pytorch 1.1 and cuda 9 version. I have tried different learning rates like 0.0001, 0.000001, 0.0000001 as suggested at in many blogs, and also have tried different betas like (0.9,0.999), (0.5,0.999) and so on. I also have tried to train it long, up to 200 epochs.

Output

Running loss : 59102027776.0 0%| | 1/630 [00:19<3:24:03, 19.46s/it]Running loss : nan 0%| | 2/630 [00:23<2:34:23, 14.75s/it]Running loss : nan 0%| | 3/630 [00:25<1:53:32, 10.87s/it]Running loss : nan

Loss Function formula

Please consider eq: 7 and 8 only in picture. click the link for formula picture

Loss Function Code

masked_pose_loss = torch.mean(
        torch.sum(mask * torch.sum(torch.mul(pred - true, pred - true), dim=[1, 2]), dim=[1, 2, 3]))

FCN

    self.relu1_1 = nn.LeakyReLU(inplace=True)
    self.conv1_2 = nn.Conv2d(32, 64, 3, stride=1,padding=1)
    self.relu1_2 = nn.LeakyReLU(inplace=True)
    self.pool1 = nn.MaxPool2d(2, stride=2, ceil_mode=True)

    # conv2

    self.conv2_1 = nn.Conv2d(64, 128, 3, stride=1, padding=1)
    self.relu2_1 = nn.LeakyReLU(inplace=True)
    self.conv2_2 = nn.Conv2d(128, 64, 1, stride=1)
    self.relu2_2 = nn.LeakyReLU(inplace=True)
    self.pool2 = nn.MaxPool2d(2, stride=2, ceil_mode=True)

    # conv3
    self.conv3_1 = nn.Conv2d(64, 128, 3, stride=1, padding=1)
    self.relu3_1 = nn.LeakyReLU(inplace=True)
    self.conv3_2 = nn.Conv2d(128, 256, 3, stride=1, padding=1)
    self.relu3_2 = nn.LeakyReLU(inplace=True)
    self.conv3_3 = nn.Conv2d(256, 128, 1, stride=1)
    self.relu3_3 = nn.LeakyReLU(inplace=True)
    self.pool3 = nn.MaxPool2d(2, stride=2, ceil_mode=True)

    # conv4
    self.conv4_1 = nn.Conv2d(128, 256, 3, stride=1, padding=1)
    self.relu4_1 = nn.LeakyReLU(inplace=True)
    self.conv4_2 = nn.Conv2d(256, 512, 3, stride=1, padding=1)
    self.relu4_2 = nn.LeakyReLU(inplace=True)
    self.conv4_3 = nn.Conv2d(512, 256, 1, stride=1)
    self.relu4_3 = nn.LeakyReLU(inplace=True)
    self.pool4 = nn.MaxPool2d(2, stride=2, ceil_mode=True)

    # conv5
    self.conv5_1 = nn.Conv2d(256, 512, 3, stride=1, padding=1)
    self.relu5_1 = nn.LeakyReLU(inplace=True)
    self.conv5_2 = nn.Conv2d(512, 256, 1, stride=1)
    self.relu5_2 = nn.LeakyReLU(inplace=True)
    self.conv5_3 = nn.Conv2d(256, 512, 3, stride=1, padding=1)
    self.relu5_3 = nn.LeakyReLU(inplace=True)
    self.pool5 = nn.MaxPool2d(2, stride=2, ceil_mode=True)

    # fc6

    self.conv6_1 = nn.Conv2d(512, 1024, 3, stride=1, padding=1)
    self.relu6_1 = nn.LeakyReLU(inplace=True)
    self.conv6_2 = nn.Conv2d(1024, 512, 1, stride=1)
    self.relu6_2 = nn.LeakyReLU(inplace=True)
    self.conv6_3 = nn.Conv2d(512, 1024, 3, stride=1, padding=1)
    self.relu6_3 = nn.LeakyReLU(inplace=True)
    self.conv7_1 = nn.Conv2d(1024, 512, 1, stride=1)
    self.relu7_1 = nn.LeakyReLU(inplace=True)
    self.conv7_2 = nn.Conv2d(512, 1024, 3, stride=1, padding=1)
    self.relu7_2 = nn.LeakyReLU(inplace=True)
    self.conv7_3 = nn.Conv2d(1024, 1024, 3, stride=1, padding=1)
    self.relu7_3 = nn.LeakyReLU(inplace=True)
    self.conv8_1 = nn.Conv2d(1024, 1024, 3, stride=1, padding=1)
    self.relu8_1 = nn.LeakyReLU(inplace=True)
    self.conv8_2 = nn.Conv2d(1024, 1024, 3, stride=1)
    self.conv8_2 = nn.ReLU(inplace=True)
    self.conv_rout16 = nn.Conv2d(512, 64, 1, stride=1)
    self.relu_rout16 = nn.ReLU(inplace=True)enter code here


    # Resuming comments: I was implementing last two layer of network:
    self.convf_1 = nn.Conv2d(1280, 1024, 3, stride=1, padding=1)
    self.reluf_1 = nn.LeakyReLU(inplace=True)

    self.convf_2 = nn.Conv2d(1024, self.target_channel_size, 1, stride=1)

    self.reluf_2 = nn.LeakyReLU(inplace=True)

your response is highly appreciated.

Note: this is the first time that I'm posting. If I have missed some necessary information please let me know.

Can you post the code of your loss function? It's likely that there's an error in there somewhere. Or better, post the whole training code so that people can follow what you're doing. — Florian Blume, Aug 17 '19 at 12:30
hi @FlorianBlume thanks for your response, I have added loss function code. — Muzamil Hussain, Aug 18 '19 at 06:34
It's a bit difficult to read from the implementation of the loss function what you are trying to do but your old explanation in words wasn't helping either. Could you give your loss as a formula again (maybe LaTeX image or something for readability) and explain in words what you are trying to achieve? Depending on the dimensions of the output of your network `torch.mul` might cause the `NaN`. Try to expand the loss into multiple steps and try to see where the `NaN` occurs. — Florian Blume, Aug 18 '19 at 06:47
Hi, @FlorianBlume thanks again, I have added loss function picture above kindly check it. this is my first post, I might be missing important details, I'm trying my best to explain it more clear. — Muzamil Hussain, Aug 18 '19 at 07:18
Are you scaling your network output in the same range as you true masked? — thefifthjack005, Aug 19 '19 at 11:26

Training loss explodes from first training example and then outputs nan

Output

Loss Function formula

Loss Function Code

FCN

0 Answers0