This question relates to the neural machine translation shown here: Neural Machine Translation
self.W1
and self.W2
are initialized to dense neural layers of 10 units each, in lines 4 and 5 in the __init__
function of class BahdanauAttention
In the code image attached, I am not sure I understand the feed forward neural network set up in line 17 and line 18. So, I broke this formula down into it's parts. See line 23 and line 24.
query_with_time_axis
is the input tensor to self.W1
and values
is input to self.W2
. And each compute the function Z = WX + b
, and the Z's are added together. The dimensions of the tensors added together are (64, 1, 10)
and (64, 16, 10)
. I am assuming random weight initialization for both self.W1
and self.W2
is handled by Keras
behind the scenes.
Question:
After adding the Z's together, a non-linearity (tanh
) is applied to come up with an activation and this resulting activation is input to the next layer self.V
, which is a layer with just one output and gives us the score
.
For this last step, we don't apply an activation function (tanh etc) to the result of self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values)))
, to get a single output from this last neural network layer.
Is there a reason why an activation function was not used for this last step?