I've been reading a lot about transformers and self attention and have seen both BERT and GPT-2 are a newer version that only use an encoder transformer (BERT) and decoder transformer (GPT-2). I've been trying to build a decoder only model for myself for next sequence prediction but am confused by one thing. I'm using PyTorch and have looked at thereSeq2Seq tutorial and then looked into the Transformer Decoder Block which is made up of Transformer Decoder Layers. My confusion comes from the memory these need to be passed as well. In the documentation they say memory is the last layer of the encoder block which makes sense for a Seq2Seq model but I'm wanting to make a decoder only model. So my question is what do you pass a decoder only model like GPT-2 for memory if you do not have an encoder?
Asked
Active
Viewed 2,003 times
1 Answers
1
After further investigation I believe I can now answer this myself. A decoder only transformer doesn't actually use any memory as there is no encoder-decoder self attention in it like there is in a encoder-decoder transformer. A decoder only transformer looks a lot like an encoder transformer only instead it uses a masked self attention layer over a self attention layer. In order to do this you can pass a square subsequent mask (upper triangle) so that the model cannot look forward to achieve a decoder only model like found in GPT-2/GPT-3.

bellerb
- 137
- 8
-
When predicting on test (new) data what targets should you provide to the decoder? Zeros? Zeros with positional encoding? – John Aug 31 '21 at 18:45
-
Do you mean output for evaluation of the model, cause typically this is in the same format as your training data? If so in my case I was performing a tokenized approach to generating MIDI files so I used token embeddings then positionally encoded them for my input, for my output I used the input embedded tokens plus the next embedded token. This trained the model for next token prediction. When using the model in the wild you simply just give it the input (embedded tokens) and get the probability distribution for the tokens as the output. – bellerb Sep 01 '21 at 20:04
-
Please have a look at the following AI SE post for an in-depth overview of the decoder-only transformer model used in large-language models such as ChatGPT and GPT-4 https://ai.stackexchange.com/questions/40179/how-does-a-transformer-or-large-language-model-work/40180#40180 – Robin van Hoorn Apr 24 '23 at 10:49