4

I have a quite simple neural network which takes a flattened 6x6 grid as input and should output the values of four actions to take on that grid, so a 1x4 tensor of values.

Sometimes after a few runs though for some reason I am getting a 1x4 tensor of nan

tensor([[nan, nan, nan, nan]], grad_fn=<ReluBackward0>)

My model looks like this with input dim being 36 and output dim being 4:

class Model(nn.Module):
    def __init__(self, input_dim, output_dim):
        # super relates to nn.Module so this initializes nn.Module
        super(Model, self).__init__()
        # Gridsize as input,
        # last layer needs 4 outputs because of 4 possible actions: left, right, up, down
        # output values are Q Values need activation function for those like argmax
        self.lin1 = nn.Linear(input_dim, 24)
        self.lin2 = nn.Linear(24, 24)
        self.lin3 = nn.Linear(24, output_dim)

    # function to feed the input through the net
    def forward(self, x):
        # rectified linear as activation function for the first two layers
        if isinstance(x, np.ndarray):
            x = torch.tensor(x, dtype=torch.float)

        activation1 = F.relu(self.lin1(x))
        activation2 = F.relu(self.lin2(activation1))
        output = F.relu(self.lin3(activation2))

        return output

The input was:

tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.3333,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.3333, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.6667]])

What are possible causes for a nan output and how can i fix those?

J.Tmr
  • 117
  • 1
  • 1
  • 7
  • 1
    You input is not normalized and you are using just relu actiovations. That could cause high values. Do you know what the highest value is that could occure in your input? If yes, devide every input sample by that number first. – Theodor Peifer Mar 14 '21 at 15:26
  • Thanks for the heads-up. I tried it with normalized input and still have the same issue sadly. – J.Tmr Mar 14 '21 at 16:33
  • 2
    See [this thread](https://stackoverflow.com/questions/33962226/common-causes-of-nans-during-training) about NaNs during training. – Shir Mar 14 '21 at 16:35
  • @Shir Thank you very much, that thread pointed me in the right direction. My loss function was using a standard deviation and pytorch's .std() function returns nan for single values. – J.Tmr Mar 18 '21 at 11:01
  • Great job for finding the problem! I bookmarked this thread and attend to it every time I have a NaN problem. – Shir Mar 18 '21 at 11:24

1 Answers1

0

nan values as outputs just mean that the training is instable which can have about every possible cause including all kinds of bugs in the code. If you think your code is correct you can try addressing the instability by lowering the learning rate or use gradient clipping.

Chris Holland
  • 559
  • 4
  • 11