calculating attention scores in Bahdanau attention in tensorflow using decoder hidden state and encoder output

Question

This question relates to the neural machine translation shown here: Neural Machine Translation

self.W1 and self.W2 are initialized to dense neural layers of 10 units each, in lines 4 and 5 in the __init__ function of class BahdanauAttention

In the code image attached, I am not sure I understand the feed forward neural network set up in line 17 and line 18. So, I broke this formula down into it's parts. See line 23 and line 24.

query_with_time_axis is the input tensor to self.W1 and values is input to self.W2. And each compute the function Z = WX + b, and the Z's are added together. The dimensions of the tensors added together are (64, 1, 10) and (64, 16, 10). I am assuming random weight initialization for both self.W1 and self.W2 is handled by Keras behind the scenes.

Question:

After adding the Z's together, a non-linearity (tanh) is applied to come up with an activation and this resulting activation is input to the next layer self.V, which is a layer with just one output and gives us the score.

For this last step, we don't apply an activation function (tanh etc) to the result of self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values))), to get a single output from this last neural network layer.

Is there a reason why an activation function was not used for this last step?

Please consider pasting the code as plain text, so it is better searchable. — Jindřich, Sep 29 '20 at 06:41

score 0 · Accepted Answer · answered Sep 29 '20 at 06:45

0

The ouput of the attention form so-called attention energies, i.e., one scalar for each encoder output. These numbers get stacked into a vector a this vector is normalized using softmax, yielding attention distribution.

So, in fact, there is non-linearity applied in the next step, which is the softmax. If you used an activation function before the softmax, you would only decrease the space of distributions that the softmax can do.

answered Sep 29 '20 at 06:45

Jindřich

10,270
2
23
44

Thanks. Yes, separated by the `print` statements. I was just too focused on looking for the outermost activation on the same line that computes the `score`. Can you please comment some more on your point regarding how using an activation before `softmax` would decrease the space of distributions that the `softmax` can do (although I don't think it would make sense to use activation function right after already having applied an activation function).. – Utpal Mattoo Sep 29 '20 at 15:05

calculating attention scores in Bahdanau attention in tensorflow using decoder hidden state and encoder output

1 Answers1