I am struggling to mask my input for the MultiHeadAttention Layer. I am using the Transformer Block from Keras documentation with self-attention. I could not find any example code online so far and would appreciate if someone could give me a code snippet.
The transformer block from this page:
class TransformerBlock(layers.Layer):
def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
super(TransformerBlock, self).__init__()
self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
self.ffn = keras.Sequential(
[layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
)
self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = layers.Dropout(rate)
self.dropout2 = layers.Dropout(rate)
def call(self, inputs, training):
attn_output = self.att(inputs, inputs)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(inputs + attn_output)
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layernorm2(out1 + ffn_output)
The documentation for masking one can find under this link:
attention_mask: a boolean mask of shape [B, T, S], that prevents attention to certain positions. The boolean mask specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. Broadcasting can happen for the missing batch dimensions and the head dimension.
The only thing, I could get running is a mask created outside of the layer class as numpy array:
mask = np.ones((observations, sequence_length, sequence_length))
mask[X[:observations,:,0]==0]=0
Then input while calling the layer, with the only change in the transformer block being:
def call(self, inputs, mask, training):
attn_output = self.att(inputs, inputs, attention_mask=mask)
However, this does of course not work when given a batch_size while fitting and does only work for 5 observations with my memory, so it doesn't make any sense. Apart from that, I don't think this is masking the input properly - In general I am quite confused about how to mask, given the shape of the attention_mask (observations, sequence_length, sequence_length). The shape of my input is (observation, sequence_length, features). This input is being padded by zeros, however, when it comes to the transformer block, it has been already through an embedding layer and CNN. I have tried various ways to write a function, which creates the mask while training with different Tensor or Keras objects. However I am running each time into errors.
I hope someone more fluent in Tensorflow/Keras will be able to provide an example. Or somebody tells me that masking is useless given my architecture. The model is performing well. However, I hoped masking could help speed up the computing. And it just buggs me that I cannot get my head around it.