Why is Encoder hidden state shape different from Encoder Output shape in Bahdanau attention

Question

This question relates to the neural machine translation shown here: Neural Machine Translation

Here:

Batch size = 64

Input length (number of words in the example input sentence and also called the distinct time steps) = 16

Number of RNN units (which is also the length of the hidden state vector or the dimensionality of the hidden state vector at each time step) = 1024

This is interpreted as:

In each batch (total 64), for each input word (total 16), there exists a 1024 dimensional vector at each time step. This 1024 dimensional vector represents the input word at it's particular time step during the encoding. This 1024-dimensional vector is called the hidden state of each word.

My question is:

Why is the hidden state dimension of (64, 1024) different from the encoder output dimension of (64, 16, 1024)? Should both not be the same because for each batch we have 16 words in the input sentence and for each word in the input sentence we have a 1024-dimensional hidden state vector. So at the end of the encoding step, we get a cumulative hidden state vector of shape (64, 16, 1024), which is also the encoder output. Both same dimensions.

The encoder hidden output with dimensions (64, 1024) is further provided as the first hidden state input to the decoder.

Another related question:

If the input length is 16 words, instead of using 16 units, what is the reason for using 1024 units in the encoder?

for your second question, In the RNN model, the number of GRU unit (e.g 1024) does not depend on the length of the sequences, we add more units to catch the complexity of the sequences. — Tou You, Sep 25 '20 at 17:47

score 1 · Accepted Answer · answered Sep 25 '20 at 18:06

"Why is the hidden state dimension of (64, 1024) ".

In your RNN model , the output for each word is a vector of shape (number of GRU units = 1024) , and if the batch is 64 , then we give 64 words to the model one word from EACH EXAMPLE in the batch , which give us an output vector of shape(64,1024) for each input.

Now , to consume all the sequence, we feed the next words up to 16,to get the normal 3d output(64,16,1024) of RNN layer .

For your second question, In the RNN model, the number of GRU unit (e.g 1024) does not depend on the length of the sequences, we add more units to the RNN layer to catch the complexity of the sequences

Thanks. That makes sense. Each time, one batch of 64 examples goes through the RNN encoder and batches are encoded by the encoder sequentially. — Utpal Mattoo, Sep 25 '20 at 19:08

Why is Encoder hidden state shape different from Encoder Output shape in Bahdanau attention

1 Answers1