2

I'm fairly new to NLP and I was reading a blog explaining the transformer model. I was quite confused about the input/output for the decoder block (attached below). I get that y_true is fed into the decoder during the training step to combine with the output of the encoder block. What I don't get is, if we already know y_true, why run this step to get the output probability? I just don't quite get the relationship between the bottom right "Output Embedding" and the top right "Output Probabilities". When we use the model, we wouldn't really have y_true, do we just use y_pred and feed them into the decoder instead? This might be a noob question. Thanks in advance.

The Decoder Block of the Transformer Architecture
Taken from “Attention Is All You Need“

Paul726
  • 109
  • 6

2 Answers2

1

I get that y_true is fed into the decoder during the training step to combine with the output of the encoder block.

Well, yes and no.

The job of the decoder block is to predict the next word. The inputs to the decoder is the output of the encoder and the previous outputs of decoder block itself.

Lets take a translation example ... English to Spanish

  • We have 5 dogs -> Nosotras tenemos 5 perros

The encoder will encode the english sentence and produce a attention vector as output. At first step the decoder will be fed the attention vector and a <START> token. The decoder will (should) produce the first spanish word Nosotras. This is the Yt. In the next step the decoder will be fed again the attention vector as well as the <START> token and the previous output Yt-1 Nosotras. tenemos will be the output, and so on and so forth, till the decoder spits out a <END> token.

The decoder is thus an Autoregressive Model. It relies on its own output to generate the next sequence.

Bhupen
  • 1,270
  • 1
  • 12
  • 27
0

In addition to @Bhupen's answer it is worth highlighting differences to seq-to-seq models based on RNNs, for which this sequential processing is always necessary.

Transformers have the fundamental advantage that you can train them with parallel processing. That means you can use the same parallel forward pass in the decoder also for the encoder during training. This allows for substantial speedups, allowing for larger training set sizes, to which transformer models owe much of their success.

So to answer the original question: The input to the decoder is

  • all correct predictions, shifted and masked, at once to be processed in parallel
  • the previous output of the decoder, starting with the special token <start>, until another special token <end> is predicted.

You can look at the excellent tensorflow implementation, i.e. check out:

Korbinian
  • 43
  • 4