6

In this tutorial in tensorflow site we can see a code for the implementation of an autoencoder which it's Decoder is as follows:

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

the BahdanauAttention is applied to the output of the encoder and previous hidden state then it is concated with the lookup of the input and then is fed to GRU.

And yet in another code from this github repository (which is implemented using pytorch) the attention is applied to the output of the GRU:

class DecoderAttn(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, out_bias):
        super(DecoderAttn, self).__init__()
        self.hidden_size = hidden_size
        self.input_size = input_size
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.emb_drop = nn.Dropout(0.2)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.gru_drop = nn.Dropout(0.2)
        self.mlp = nn.Linear(hidden_size*2, output_size)
        if out_bias is not None:
            out_bias_tensor = torch.tensor(out_bias, requires_grad=False)
            self.mlp.bias.data[:] = out_bias_tensor
        self.logsoftmax = nn.LogSoftmax(dim=2)
        
        self.att_mlp = nn.Linear(hidden_size, hidden_size, bias=False)
        self.attn_softmax = nn.Softmax(dim=2)
    
    def forward(self, input, hidden, encoder_outs):
        emb = self.embedding(input)
        out, hidden = self.gru(self.emb_drop(emb), hidden)
        
        out_proj = self.att_mlp(out)
        enc_out_perm = encoder_outs.permute(0, 2, 1)
        e_exp = torch.bmm(out_proj, enc_out_perm)
        attn = self.attn_softmax(e_exp)
        
        ctx = torch.bmm(attn, encoder_outs)
        
        full_ctx = torch.cat([self.gru_drop(out), ctx], dim=2)
        
        out = self.mlp(full_ctx)
        out = self.logsoftmax(out)
        return out, hidden, attn

I wonder if the second case is a mistake? If it is not a mistake what is the difference between it and the first decoder? How changing the attention place affect the output?

Marzi Heidari
  • 2,660
  • 4
  • 25
  • 57
  • 1
    Obviously, depending on what you want to do. In these cases, self attention is applied to encoded representation. I don't see much difference in both cases. If they are the same GRU autoencoder, shouldn't they be the same? Why do you think their attention mechanisms are different? – ghchoi Dec 05 '20 at 16:12
  • @GyuHyeonChoi they are from two different sources. They both used for Image Captioning. The first one uses attention on the input of GRU while the second one applies it to the output of GRU. I wanna know the reason and the diffrence – Marzi Heidari Dec 05 '20 at 16:15
  • 2
    I am afraid that there isn't an authorizative answer. Unless they are equivalent, only experiments on the specific dataset/model can tell you which is better. – hkchengrex Dec 17 '20 at 09:41
  • 1
    To echo the other commenters: attention is a very generic operation. I am not aware of any limitations on it's application that prescribes where you must use it. Your question is roughly akin to asking "why do some people put linear layers before a GRUs and some put it after?" The only reasonable answer is generally: "because it worked for them and/or they didn't try using it a different way". – Multihunter Dec 23 '20 at 01:11
  • Voting to close as this question is off-topic. – Ivan Aug 12 '21 at 08:43

0 Answers0