-1

For example, let's say I open up the playground and type "Quack". What does the model do with those 5 characters to figure out what letters or words should come next?

(As it happens, GPT3 filled in that prompt with "Quackery", then a tirade against cell therapy. Weird).

Rubén
  • 34,714
  • 9
  • 70
  • 166

2 Answers2

1

It is a long process to encode/pre-process the input question before it is ready to feed to the model, and then the reverse process to decode to give you the answer. This is what I gathered from reading an excellent summary by dugas.ch: The GPT-3 Architecture on a Napkin (https://dugas.ch/artificial_curiosity/GPT_architecture.html)

  1. Use byte-pair encoding (BPE) tokenizer to split into tokens (Your input quack is one word, and it probably still be quack after tokenizer).
  2. Input is padded to 2048 words/tokens. (Note: Dugas.ch original article is first padded to 2048 words, but since a word can be broken into more than one token, it seems logical to do it before padding).
  3. GPT has a vocab of 50257 words/tokens. So use one hot encoding to encode these 2048 words, resulting in a matrix of size 2048 (row) x 50257 (col)
  4. Use the embedding technique in LSA (Latent Semantics Analysis) employing SVD (singular value decomposition) to reduce the dimension. The embedding matrix is 12288 vectors. So now the 2048x50257 is to multiply to a matrix of 50257x12288 matrix, result in a matrix 2048x12288.
  5. Embedding is great to extract the semantic of the input, but is position independent. To account for positional significance, positional encoding is added. The position matrix is also 2048 x 12288. This matrix is added to the matrix in step4, result in a matrix 2048 x 12288.
  6. Then there is a series of steps to perform self-attention and multi-head attention (96 layers?). The output is 2048 x 12288 matrix.
  7. The Napkin article then talked about normalization and decoding. For decoding, you just go reverse the steps 4 and 3 and 2 to get the words output of 2048 x 50237.
  8. Each row of 2048 x 50237 is a vector of probabilities. Softmax can be used to pick the word(s) to output. Commonly used is simply to choose the one with the highest probability.

This is all I can figure out ... The article does not say when the trained model was used for inferencing. If someone knows the answer, I would like to know.

MonaPy
  • 11
  • 3
0

It is hard to give a good summary of all that happens in GPT-3 but i will try.

First the model encodes the word Quack into token representations, these tokens have an embedding representation, the tokens are later passed through the decoder components of the model passing through several neural network layers. Once the first decoder transformer block processes the token, it sends its resulting vector up the stack to be processed by the next block. The process is identical in each block, but each block has its own weights in both self-attention and the neural network sublayers. In the end you end up with an array of output token probabilities and you use the combined (or parts of the) array to select what the model considers as the most optimal combination of tokens for the output. These tokens are decoded back into normal text and you get your rant against cell therapy back.

The result varies depending of the engine, temperature and logit biases that are feed in the request.

I recommend reading the following two links for getting more insights about what happens internally, both written by the brilliant Jay Alammar.

https://jalammar.github.io/how-gpt3-works-visualizations-animations/

https://jalammar.github.io/illustrated-gpt2/

edutuario
  • 76
  • 4