Pytorch CE Loss evaluating to nan

Question

I am training U-net with 3 channels, input data 40x40x40 set of voxels and using CELoss followed by ReLU activation. Also, there is only one convolution layer unlike the original U-net. For many iterations, my test loss evaluates to nan but train loss always gives a defined value. What could be the probable reason and how can this issue be sorted out?

I tried changing the activation function and also introducing the batch normalization layer after the convolution. But it didn't help.

Epoch: 35 | train_loss: 0.3518 | train_acc: 0.9167 | test_loss: nan | test_acc: 0.9061 Epoch: 36 | train_loss: 0.2981 | train_acc: 0.9230 | test_loss: nan | test_acc: 0.9112 Epoch: 37 | train_loss: 0.2415 | train_acc: 0.9392 | test_loss: 0.3065 | test_acc: 0.9188 — Pranjali Singh, Jul 31 '23 at 16:08
And none of the test predictions are `nan`? Can you show the relevant code? — dan1st, Jul 31 '23 at 16:16

score 0 · Answer 1 · answered Aug 01 '23 at 14:34

Code for the loss evaluation part

model.eval()
test_loss, test_acc, b1,b2,s1,s2 = 0, 0, 0, 0, 0, 0
loss_fn: torch.nn.Module = nn.CrossEntropyLoss(ignore_index = 2, reduction='mean')

with torch.inference_mode():
    # Loop through DataLoader batches
    for batch, input_dataset in enumerate(dataloader):
        Input = input_dataset[0]
        Target = input_dataset[1]
       
        prediction = model(Input.float())
        loss = loss_fn(prediction, Target.long())
        print("Loss",loss)
        test_loss += loss.item()
        
        predicted_label = prediction.argmax(dim=1)
        s1 += (Target==1).sum().item()
        b1 += (Target==0).sum().item()
        
        acc, sb = f1_loss(predicted_label, Target)
        s2 += sb[1]
        b2 += sb[0]
        
        test_acc += acc

test_loss = test_loss / len(dataloader)
print("DL",len(dataloader))

All output from the main function:

Loss tensor(nan)
Loss tensor(10.8893)
Loss tensor(12.2123)
DL 3
Epoch: 2 | train_loss: 111.9303 | train_acc: 0.9415 | test_loss: nan | test_acc: 0.8643
Loss tensor(nan)
Loss tensor(39.5226)
Loss tensor(17.3606)
DL 3
Epoch: 3 | train_loss: 25.7117 | train_acc: 0.7250 | test_loss: nan | test_acc: 0.9513

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Aug 06 '23 at 11:18

Pytorch CE Loss evaluating to nan

1 Answers1