I am new to attention mechanisms and I want to learn more about it by doing some practical examples. I came across a Keras implementation for multi-head attention found it in this website Pypi keras multi-head. I found two different ways to implement it in Keras.
- One way is to use a multi-head attention as a keras wrapper layer with either LSTM or CNN. This is a snippet of implementating multi-head as a wrapper layer with LSTM in Keras. This example is taken from this website keras multi-head"
import keras
from keras_multi_head import MultiHead
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=100, output_dim=20, name='Embedding'))
model.add(MultiHead(keras.layers.LSTM(units=64), layer_num=3, name='Multi-LSTMs'))
model.add(keras.layers.Flatten(name='Flatten'))
model.add(keras.layers.Dense(units=4, activation='softmax', name='Dense'))
model.build()
model.summary()
- The other way is to use it separately as a stand-alone layer. This is a snippet of the second implementation for multi-head as stand-alone laye, also taken from keras multi-head"
import keras
from keras_multi_head import MultiHeadAttention
input_layer = keras.layers.Input( shape=(2, 3), name='Input',)
att_layer = MultiHeadAttention( head_num=3, name='Multi-Head',)(input_layer)
model = keras.models.Model(inputs=input_layer, outputs=att_layer)
model.compile( optimizer='adam', loss='mse', metrics={},)
I have been trying to find some documents that explain this but I have not found yet.
Update:
What I have found was that the second implementation (MultiHeadAttention) is more like the Transformer paper "Attention All You Need". However, I am still struggling to understand the first implementation which is the wrapper layer.
Does the first one (as a wrapper layer) would combine the output of multi-head with LSTM?.
I was wondering if someone could explain the idea behind them, especially, the wrapper layer.