Defining dimension of NMT and image captioning with attention at the decoder part

Question

I have been checking out models with attention in those tutorials below.

https://www.tensorflow.org/tutorials/text/nmt_with_attention

and

https://www.tensorflow.org/tutorials/text/image_captioning

In both tutorials, I do not understand the defining decoder part.

in NMT with attention decoder part as below,

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

here above , # x shape after passing through embedding == (batch_size, 1, embedding_dim) x = self.embedding(x). what should be x here? is it just target input?
here above, I do not understand why the output shape has to be (batch_size * 1, hidden_size). why batch_size*1?

and image captioning decoder part as below,

class RNN_Decoder(tf.keras.Model):
  def __init__(self, embedding_dim, units, vocab_size):
    super(RNN_Decoder, self).__init__()
    self.units = units

    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc1 = tf.keras.layers.Dense(self.units)
    self.fc2 = tf.keras.layers.Dense(vocab_size)

    self.attention = BahdanauAttention(self.units)

  def call(self, x, features, hidden):
    # defining attention as a separate model
    context_vector, attention_weights = self.attention(features, hidden)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # shape == (batch_size, max_length, hidden_size)
    x = self.fc1(output)

    # x shape == (batch_size * max_length, hidden_size)
    x = tf.reshape(x, (-1, x.shape[2]))

    # output shape == (batch_size * max_length, vocab)
    x = self.fc2(x)

    return x, state, attention_weights

  def reset_state(self, batch_size):
    return tf.zeros((batch_size, self.units))

why output shape has to be reshaped as (batch_size * max_length, hidden_size)?

Could someone please give me the detail?

This would help me a lot

score 0 · Accepted Answer · answered Apr 17 '20 at 07:19

0

The reason for the reshaping is calling the fully-connected layer that in TensorFlow (unlike Pytorch) accepts only two-dimensional inputs.

In the first example, the call method of the decoder is supposed to be executed within a for loop for each time step (both at training and inference time). But, GRU needs input in shape batch × length × dim, and if you call it step-by-step, the length is 1.

In the second example, you can call the decoder on the entire ground-truth sequence at the training time, but it still will work with length 1, so you can use it in a for loop at inference time.

answered Apr 17 '20 at 07:19

Jindřich

10,270
2
23
44

Thank you for your clarification! It helped much. – Jun May 04 '20 at 02:37
Please if you think the answer is correct, mark it correct, so other people that come across the same problem know. – Jindřich May 04 '20 at 07:21

Defining dimension of NMT and image captioning with attention at the decoder part

1 Answers1