0

In the tf.keras tutorial: https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/transformer.ipynb,

class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self,*,
               d_model, # Input/output dimensionality.
               num_attention_heads,
               dff, # Inner-layer dimensionality.
               dropout_rate=0.1
               ):
    super().__init__()


    # Multi-head self-attention.
    self.mha = tf.keras.layers.MultiHeadAttention(
        num_heads=num_attention_heads,
        key_dim=d_model, # Size of each attention head for query Q and key K.
        dropout=dropout_rate,
        )

According to the code comment and its documentation, the relation between d_model and key_dim should have been such that d_model/key_dim = num_heads. This convention is also used in the original paper https://arxiv.org/pdf/1706.03762.pdf (on top of page 5):

"In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64."

enter image description here

So is the configuration "wrong" in this tutorial, and it should have been:

 key_dim=d_model//num_attention_heads
kawingkelvin
  • 3,649
  • 2
  • 30
  • 50
  • The way they configured it in the tutorial, will obviously lead to a lot more trainable weights. Will actually try to run the training portion to see whats diff. – kawingkelvin Sep 29 '22 at 23:18

0 Answers0