What is the difference between the two different codes of Bahdanau's Attention given in Official Tensorflow tutorials?

Question

I was reading and coding for Machine Translation Task and stumped across the two different tutorials.

One of them is Caption Generation using Visual Attention paper implementation where they have used Image features of [64,2048] in a way such that each image is a sentence of 64 words and each word in the sentence having an embedding of 2048 length. I totally get that implementation and here is the code below for Bahdanau's Additive style Attention:

class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, features, hidden):
    hidden_with_time_axis = tf.expand_dims(hidden, 1)
    attention_hidden_layer = (tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis)))
  
    score = self.V(attention_hidden_layer)

    attention_weights = tf.nn.softmax(score, axis=1)

    context_vector = attention_weights * features
    context_vector = tf.reduce_sum(context_vector, axis=1)
    
    return context_vector, attention_weights

But when I went to Neural Machine Language Translation Task, I found this complex there which I am not able to comprehend what is happening here:

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super().__init__()
    self.W1 = tf.keras.layers.Dense(units, use_bias=False)
    self.W2 = tf.keras.layers.Dense(units, use_bias=False)
    
    self.attention = tf.keras.layers.AdditiveAttention()

  def call(self, query, value, mask):
    w1_query = self.W1(query)
    w2_key = self.W2(value)

    query_mask = tf.ones(tf.shape(query)[:-1], dtype=bool)
    value_mask = mask

    context_vector, attention_weights = self.attention(inputs = [w1_query, value, w2_key],mask=[query_mask, value_mask],return_attention_scores = True,)
    return context_vector, attention_weights

I want to ask

What is the difference between the two?
Why can't we use the Code for Caption Generation in the Second one or vice versa?

Does this answer your question? [Output shapes of Keras AdditiveAttention Layer](https://stackoverflow.com/questions/67353657/output-shapes-of-keras-additiveattention-layer) — Innat, May 27 '21 at 15:40
So you mean to say we can use both things in both tasks interchangeably, if I can pass my in and out parameters of same shape? — Deshwal, May 27 '21 at 16:33
They look like two different implementations of the same thing, with subtle differences between them. Looking at the code, I don't think they would have any practical difference in performance. However, to get the correct exact implementation, you'll have to go through the paper that introduced Bahdanau Attention. — Susmit Agrawal, May 27 '21 at 17:29
@SusmitAgrawal Are they producing the `decoder_state_t` from using the scores, current word and `hidden_decoder state_t-1`. And then again using that word, scores and this `new_decoder_state_t` to produce vectors? I think they are doing this on the second code. Can you comment? — Deshwal, May 28 '21 at 04:42

What is the difference between the two different codes of Bahdanau's Attention given in Official Tensorflow tutorials?

0 Answers0