I have simple LSTM network that looks roughly like this:
lstm_activation = tf.nn.relu
cells_fw = [LSTMCell(num_units=100, activation=lstm_activation),
LSTMCell(num_units=10, activation=lstm_activation)]
stacked_cells_fw = MultiRNNCell(cells_fw)
_, states = tf.nn.dynamic_rnn(cell=stacked_cells_fw,
inputs=embedding_layer,
sequence_length=features['length'],
dtype=tf.float32)
output_states = [s.h for s in states]
states = tf.concat(output_states, 1)
My question is. When I don't use activation (activation=None) or use tanh everything works but when I switch relu I'm keep getting "NaN loss during training", why is that?. It's 100% reproducible.