This question relates to the neural machine translation shown here: Neural Machine Translation
Here:
Batch size = 64
Input length (number of words in the example input sentence and also called the distinct time steps) = 16
Number of RNN units (which is also the length of the hidden state vector or the dimensionality of the hidden state vector at each time step) = 1024
This is interpreted as:
In each batch (total 64), for each input word (total 16), there exists a 1024 dimensional vector at each time step. This 1024 dimensional vector represents the input word at it's particular time step during the encoding. This 1024-dimensional vector is called the hidden state of each word.
My question is:
Why is the hidden state dimension of (64, 1024) different from the encoder output dimension of (64, 16, 1024)? Should both not be the same because for each batch we have 16 words in the input sentence and for each word in the input sentence we have a 1024-dimensional hidden state vector. So at the end of the encoding step, we get a cumulative hidden state vector of shape (64, 16, 1024), which is also the encoder output. Both same dimensions.
The encoder hidden output with dimensions (64, 1024) is further provided as the first hidden state input to the decoder.
Another related question:
If the input length is 16 words, instead of using 16 units, what is the reason for using 1024 units in the encoder?