In the tf.keras tutorial: https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/transformer.ipynb,
class EncoderLayer(tf.keras.layers.Layer):
def __init__(self,*,
d_model, # Input/output dimensionality.
num_attention_heads,
dff, # Inner-layer dimensionality.
dropout_rate=0.1
):
super().__init__()
# Multi-head self-attention.
self.mha = tf.keras.layers.MultiHeadAttention(
num_heads=num_attention_heads,
key_dim=d_model, # Size of each attention head for query Q and key K.
dropout=dropout_rate,
)
According to the code comment and its documentation, the relation between d_model and key_dim should have been such that d_model/key_dim = num_heads. This convention is also used in the original paper https://arxiv.org/pdf/1706.03762.pdf (on top of page 5):
"In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64."
So is the configuration "wrong" in this tutorial, and it should have been:
key_dim=d_model//num_attention_heads