"val_loss" didn't improved from inf, but loss decreases nicely

Question

I'm training a Keras model with a custom function, which I have already teste successfully before. Recently, I'm training it with a new dataset and I've got a strange result: The model trains fine but the val_loss gives nan. Here is the loss:

def Loss(y_true,y_pred):
    y_pred = relu(y_pred)
    z = k.maximum(y_true, y_pred)
    y_pred_negativo = Lambda(lambda x: -x)(y_pred)
    w = k.abs(add([y_true, y_pred_negativo])) 
    if k.sum(z) == 0:
        error = 0
    elif k.sum(y_true) == 0 and k.sum(z) != 0:
        error = 100
    elif k.sum(y_true) == 0 and k.sum(z) == 0:
        error = 0
    else:
        error = (k.sum(w)/k.sum(z))*100
    return error

I have tried many things:

Looked at the data for NaNs
Normalization - on and off
Clipping - on and off
Dropouts - on and off

Someone told me that it could be a problem with CUDA installation, but I'm not sure.

Any idea about what is the problem or how I can diagnosis it?

what i think from what you wrote is that theres a sample in the new validation dataset that gives the strange result, what i will try is to do a test: use the validation dataset as training dataset (with no validation) and see if after a specific sample the loss becomes nan — BestDogeStackoverflow, Apr 09 '21 at 22:38
@BestDogeStackoverflow it is a real mystery. I have done everything I could. I have to look at the y_pred and y_true values, because the another loss measures validates it without problem. — Marlon Teixeira, Apr 11 '21 at 14:27
@BestDogeStackoverflow I'm printing the error value from the loss function and I can see that many values are nan. However, I have to find out why... — Marlon Teixeira, Apr 11 '21 at 15:38
@BestDogeStackoverflow I've found the problem. I'm using a simple conditional for a keras tensor, as so, it is not applying and it is making a division by zero. I have to rewrite that as keras conditional. — Marlon Teixeira, Apr 11 '21 at 16:39
seems you got lucky with the first dataset :D, mark the question as answered when you can — BestDogeStackoverflow, Apr 11 '21 at 17:13
@BestDogeStackoverflow first I have to rewrite this condicional in terms of Keras. Then I'll answer it. — Marlon Teixeira, Apr 11 '21 at 17:58

score 0 · Accepted Answer · answered Apr 11 '21 at 19:24

The problem turned out to be division per zero, but the reason why it was taking place was a little tricky. As you can see, the above definition has some conditionals which were supposed to preclude division per zero. However, they were written to handle NumPy objects and not tensors, which are the objects passed by the Keras methods. Therefore, they were never taking place, and division per zero was happening very often.

In order to fix it, I had to rewrite the Loss in terms of Keras conditionals - remind, avoiding to mix pure Keras with tf.keras - just as I've posted here. Any further comment is more than welcomed!

@BestDogeStackoverflow check it out! – Marlon Teixeira Apr 11 '21 at 19:25 — Marlon Teixeira, Apr 11 '21 at 19:25

"val_loss" didn't improved from inf, but loss decreases nicely

1 Answers1