Keras BinaryCrossentropy loss gives NaN for angular distance between two vectors

Question

I want to train a siamese-LSTM such that the angular distance of two outputs is 1 (low similarity) if the corresponding label is 0 and 0 (high similarity) if the label is 1.

I took the formular for angular distance from here: https://en.wikipedia.org/wiki/Cosine_similarity

This is my model code:

# inputs are unicode encoded int arrays from strings
# similar string should yield low angular distance
left_input = tf.keras.layers.Input(shape=[None, 1], dtype='float32')
right_input = tf.keras.layers.Input(shape=[None, 1], dtype='float32')
lstm = tf.keras.layers.LSTM(10)
left_embedding = lstm(left_input)
right_embedding = lstm(right_input)
# cosine_layer is the operation to get cosine similarity
cosine_layer = tf.keras.layers.Dot(axes=1, normalize=True)
cosine_similarity = cosine_layer([left_embedding, right_embedding])
# next two lines calculate angular distance but with inversed labels
arccos = tf.math.acos(cosine_similarity)
angular_distance = arccos / math.pi # not 1. - (arccos / math.pi)
model = tf.keras.Model([left_input, right_input], [angular_distance])
model.compile(loss='binary_crossentropy', optimizer='sgd')
print(model.summary())

The model summary looks fine to me, also when testing with fixed input values I got correct values for my cosine similarity etc.:

Model: "model_37"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_95 (InputLayer)           [(None, None, 1)]    0                                            
__________________________________________________________________________________________________
input_96 (InputLayer)           [(None, None, 1)]    0                                            
__________________________________________________________________________________________________
lstm_47 (LSTM)                  (None, 10)           480         input_95[0][0]                   
                                                                 input_96[0][0]                   
__________________________________________________________________________________________________
dot_47 (Dot)                    (None, 1)            0           lstm_47[0][0]                    
                                                                 lstm_47[1][0]                    
__________________________________________________________________________________________________
tf_op_layer_Acos_52 (TensorFlow [(None, 1)]          0           dot_47[0][0]                     
__________________________________________________________________________________________________
tf_op_layer_truediv_37 (TensorF [(None, 1)]          0           tf_op_layer_Acos_52[0][0]        
__________________________________________________________________________________________________
tf_op_layer_sub_20 (TensorFlowO [(None, 1)]          0           tf_op_layer_truediv_37[0][0]     
__________________________________________________________________________________________________
tf_op_layer_sub_21 (TensorFlowO [(None, 1)]          0           tf_op_layer_sub_20[0][0]         
__________________________________________________________________________________________________
tf_op_layer_Abs (TensorFlowOpLa [(None, 1)]          0           tf_op_layer_sub_21[0][0]         
==================================================================================================
Total params: 480
Trainable params: 480
Non-trainable params: 0
__________________________________________________________________________________________________
None

But upon training I always get a loss of NaN

model.fit([np.array(x_left_train), np.array(x_right_train)], np.array(y_train).reshape((-1,1)), batch_size=1, epochs=2, validation_split=0.1)

Train on 14400 samples, validate on 1600 samples
Epoch 1/2
  673/14400 [>.............................] - ETA: 5:42 - loss: nan

Is this not the correct way to get the similarity between two vectors and training my network to produce those vectors?

score 1 · Accepted Answer · answered Oct 22 '19 at 14:56

1

Binary cross entropy calculates log(output) and log(1-output). This means that your output needs to be strictly greater than 0 and strictly less than 1 as otherwise you will calculate the log of a negative number which results in NaN. (Note: log(0) should give you -inf which is not as bad as NaN, but still not desirable)

Mathematically, your output should be in the correct interval, but due to the inaccuracy of floating point operations, I can very well imagine that this is your problem. However, this is just a guess.

So, try to enforce your output to be greater than 0 and less than 1, e.g. by using clip with a small epsilon:

angular_distance = tf.keras.backend.clip(angular_distance, 1e-6, 1 - 1e-6)

answered Oct 22 '19 at 14:56

sebrockm

5,733
2
16
39

I thought about that too, I had many ouputs that were close to 1, but I did not think that this could be the reason since for eg. sigmoid outputs close to 0 and 1 are possible too right? – Jonathan R Oct 22 '19 at 15:40
@JonathanR close *above* 0 and close *below* 1 is no issue at all, right. If you are absolutely certain that your values are always in this range (`clip` is one way if achieving this certainty), then this is not your issue. – sebrockm Oct 22 '19 at 15:49
do you think that MSE ia a better fit here as loss function? In my mind binary crossentropy was the obvious choice but now i am not certain anymore – Jonathan R Oct 22 '19 at 18:39
@JonathanR for me, BCE is the obvious choice, too. But you can try MSE, maybe it helps you figure out, where the NaNs come from – sebrockm Oct 22 '19 at 18:48

Keras BinaryCrossentropy loss gives NaN for angular distance between two vectors

1 Answers1