Multi-Head attention layers - what is a warpper multi-head layer in Keras?

Question

I am new to attention mechanisms and I want to learn more about it by doing some practical examples. I came across a Keras implementation for multi-head attention found it in this website Pypi keras multi-head. I found two different ways to implement it in Keras.

One way is to use a multi-head attention as a keras wrapper layer with either LSTM or CNN. This is a snippet of implementating multi-head as a wrapper layer with LSTM in Keras. This example is taken from this website keras multi-head"

import keras
from keras_multi_head import MultiHead

model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=100, output_dim=20, name='Embedding'))
model.add(MultiHead(keras.layers.LSTM(units=64), layer_num=3, name='Multi-LSTMs'))
model.add(keras.layers.Flatten(name='Flatten'))
model.add(keras.layers.Dense(units=4, activation='softmax', name='Dense'))
model.build()
model.summary()

The other way is to use it separately as a stand-alone layer. This is a snippet of the second implementation for multi-head as stand-alone laye, also taken from keras multi-head"

import keras
from keras_multi_head import MultiHeadAttention

input_layer = keras.layers.Input( shape=(2, 3), name='Input',)
att_layer = MultiHeadAttention( head_num=3, name='Multi-Head',)(input_layer)
model = keras.models.Model(inputs=input_layer, outputs=att_layer)
model.compile( optimizer='adam', loss='mse', metrics={},)

I have been trying to find some documents that explain this but I have not found yet.

Update:

What I have found was that the second implementation (MultiHeadAttention) is more like the Transformer paper "Attention All You Need". However, I am still struggling to understand the first implementation which is the wrapper layer.

Does the first one (as a wrapper layer) would combine the output of multi-head with LSTM?.

I was wondering if someone could explain the idea behind them, especially, the wrapper layer.

Have you try this https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention — Innat, Apr 13 '21 at 07:03

score 0 · Answer 1 · edited Jan 27 '21 at 01:09

I understand your confusion. From my experience, what the Multihead (this wrapper) does is that it duplicates (or parallelize) layers to form a kind of multichannel architecture, and each channel can be used to extract different features from the input.

For instance, each channel can have a different configuration, which is later concatenated to make an inference. So, the MultiHead can be used to wrap conventional architectures to form multihead-CNN, multihead-LSTM etc.

Note that the attention layer is different. You may stack attention layers to form a new architecture. You may also parallelize the attention layer (MultiHeadAttention) and configure each layer as explained above. See here for different implementation of the attention layer.

Do you know if there is any similar resource for pytorch? With same examples for multihead-lstm... — Elidor00, Feb 11 '21 at 18:12

Multi-Head attention layers - what is a warpper multi-head layer in Keras?

1 Answers1