4

I am new to attention mechanisms and I want to learn more about it by doing some practical examples. I came across a Keras implementation for multi-head attention found it in this website Pypi keras multi-head. I found two different ways to implement it in Keras.

  1. One way is to use a multi-head attention as a keras wrapper layer with either LSTM or CNN. This is a snippet of implementating multi-head as a wrapper layer with LSTM in Keras. This example is taken from this website keras multi-head"
import keras
from keras_multi_head import MultiHead

model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=100, output_dim=20, name='Embedding'))
model.add(MultiHead(keras.layers.LSTM(units=64), layer_num=3, name='Multi-LSTMs'))
model.add(keras.layers.Flatten(name='Flatten'))
model.add(keras.layers.Dense(units=4, activation='softmax', name='Dense'))
model.build()
model.summary()
  1. The other way is to use it separately as a stand-alone layer. This is a snippet of the second implementation for multi-head as stand-alone laye, also taken from keras multi-head"
import keras
from keras_multi_head import MultiHeadAttention

input_layer = keras.layers.Input( shape=(2, 3), name='Input',)
att_layer = MultiHeadAttention( head_num=3, name='Multi-Head',)(input_layer)
model = keras.models.Model(inputs=input_layer, outputs=att_layer)
model.compile( optimizer='adam', loss='mse', metrics={},)

I have been trying to find some documents that explain this but I have not found yet.

Update:

What I have found was that the second implementation (MultiHeadAttention) is more like the Transformer paper "Attention All You Need". However, I am still struggling to understand the first implementation which is the wrapper layer.

Does the first one (as a wrapper layer) would combine the output of multi-head with LSTM?.

I was wondering if someone could explain the idea behind them, especially, the wrapper layer.

Amhs_11
  • 233
  • 3
  • 10

1 Answers1

0

I understand your confusion. From my experience, what the Multihead (this wrapper) does is that it duplicates (or parallelize) layers to form a kind of multichannel architecture, and each channel can be used to extract different features from the input.

For instance, each channel can have a different configuration, which is later concatenated to make an inference. So, the MultiHead can be used to wrap conventional architectures to form multihead-CNN, multihead-LSTM etc.

Note that the attention layer is different. You may stack attention layers to form a new architecture. You may also parallelize the attention layer (MultiHeadAttention) and configure each layer as explained above. See here for different implementation of the attention layer.

Dharman
  • 30,962
  • 25
  • 85
  • 135