Not quite. Unfortunately, Figure 1 in the mentioned paper is a bit misleading. It is not that the six encoding layers are in parallel, as it might be understood from the figure, but rather that these layers are successive, meaning that the hidden state/output from the previous layer is used in the subsequent layer as an input.
This, and the fact that the input (embedding) dimension is NOT the output dimension of the LSTM layer (in fact, it is 2 * hidden_size
) change your output dimension to exactly that: 2 * hidden_size
, before it is put into the final projection layer, which again is changing the dimension depending on your specifications.
It is not quite clear to me what the description of add does in the layer, but if you look at a reference implementation it seems to be irrelevant to the answer. Specifically, observe how the encoding function is basically
def encode(...):
encode_inputs = self.embed(...)
for l in num_layers:
prev_input = encode_inputs
encode_inputs = self.nth_layer(...)
# ...
Obviously, there is a bit more happening here, but this illustrates the basic functional block of the network.