The decoder part in a transformer model

Question

I'm fairly new to NLP and I was reading a blog explaining the transformer model. I was quite confused about the input/output for the decoder block (attached below). I get that y_true is fed into the decoder during the training step to combine with the output of the encoder block. What I don't get is, if we already know y_true, why run this step to get the output probability? I just don't quite get the relationship between the bottom right "Output Embedding" and the top right "Output Probabilities". When we use the model, we wouldn't really have y_true, do we just use y_pred and feed them into the decoder instead? This might be a noob question. Thanks in advance.

Bhupen · Accepted Answer · 2022-06-28T14:11:56.050

I get that y_true is fed into the decoder during the training step to combine with the output of the encoder block.

Well, yes and no.

The job of the decoder block is to predict the next word. The inputs to the decoder is the output of the encoder and the previous outputs of decoder block itself.

Lets take a translation example ... English to Spanish

We have 5 dogs -> Nosotras tenemos 5 perros

The encoder will encode the english sentence and produce a attention vector as output. At first step the decoder will be fed the attention vector and a <START> token. The decoder will (should) produce the first spanish word Nosotras. This is the Y_t. In the next step the decoder will be fed again the attention vector as well as the <START> token and the previous output Y_t-1 Nosotras. tenemos will be the output, and so on and so forth, till the decoder spits out a <END> token.

The decoder is thus an Autoregressive Model. It relies on its own output to generate the next sequence.

Korbinian · Answer 2 · 2022-11-09T11:16:13.703

In addition to @Bhupen's answer it is worth highlighting differences to seq-to-seq models based on RNNs, for which this sequential processing is always necessary.

Transformers have the fundamental advantage that you can train them with parallel processing. That means you can use the same parallel forward pass in the decoder also for the encoder during training. This allows for substantial speedups, allowing for larger training set sizes, to which transformer models owe much of their success.

So to answer the original question: The input to the decoder is

all correct predictions, shifted and masked, at once to be processed in parallel
the previous output of the decoder, starting with the special token <start>, until another special token <end> is predicted.

You can look at the excellent tensorflow implementation, i.e. check out:

Inference, which is indeed the sequential processing as described in the previous answer https://www.tensorflow.org/text/tutorials/transformer#run_inference
But also check the decoder, where you can see that there is no inherent sequential nature (only iterating through layers) https://www.tensorflow.org/text/tutorials/transformer#the_decoder_layer

The decoder part in a transformer model

2 Answers2