How to add an attention mechanism in keras?

Question

I'm currently using this code that i get from one discussion on github Here's the code of the attention mechanism:

_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=False
    )(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)


sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

Is this the correct way to do it? i was sort of expecting the existence of time distributed layer since attention mechanism is distributed in every time step of the RNN. I need someone to confirm that this implementation(the code) is a correct implementation of attention mechanism. Thank you.

here a simple way to add attention: https://stackoverflow.com/questions/62948332/how-to-add-attention-layer-to-a-bi-lstm/62949137#62949137 — Marco Cerliani, Jul 17 '20 at 14:57

score 19 · Accepted Answer · answered Jun 06 '17 at 10:28

19

If you want to have an attention along the time dimension, then this part of your code seems correct to me:

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

sent_representation = merge([activations, attention], mode='mul')

You've worked out the attention vector of shape (batch_size, max_length):

attention = Activation('softmax')(attention)

I've never seen this code before, so I can't say if this one is actually correct or not:

K.sum(xin, axis=-2)

Further reading (you might have a look):

answered Jun 06 '17 at 10:28

Philippe Remy

1,801
3
15
20

6

Hi there Philippe! Checking the original paper by Bahdanau et. al., there seems to be some gap between your implementation and the proposed approach, namely the dimensions of the weight matrices and score calculation. If this was an implementation of self-attention instead, as suggested by @felixhao28 in your repo, then a similar gap exists yet again. It may be the case that I am unaware of this type of implementation, would you mind sharing which paper or study you based your implementation on? – Uzay Macar Jun 18 '19 at 01:20
There are many flavors of attention. The original paper by Bahdanau introduced attention for the first time and was complicated. There are simpler versions which do the job now. The OPs way of doing is fine and needed only minor changes to make it work as I have shown below – Allohvk Mar 04 '21 at 15:55

Abhijay Ghildyal · Answer 2 · 2019-12-27T20:34:33.580

Recently I was working with applying attention mechanism on a dense layer and here is one sample implementation:

def build_model():
  input_dims = train_data_X.shape[1]
  inputs = Input(shape=(input_dims,))
  dense1800 = Dense(1800, activation='relu', kernel_regularizer=regularizers.l2(0.01))(inputs)
  attention_probs = Dense( 1800, activation='sigmoid', name='attention_probs')(dense1800)
  attention_mul = multiply([ dense1800, attention_probs], name='attention_mul')
  dense7 = Dense(7, kernel_regularizer=regularizers.l2(0.01), activation='softmax')(attention_mul)   
  model = Model(input=[inputs], output=dense7)
  model.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])
  return model

print (model.summary)

model.fit( train_data_X, train_data_Y_, epochs=20, validation_split=0.2, batch_size=600, shuffle=True, verbose=1)

Hey thank you for this toy example of attention but can I give some recommendations. Write the full model as it is and not as `baseline_model()`. Like how AryoPradiptaGema wrote the model. Can you please explain the model with the math https://i.stack.imgur.com/pyqhn.gif . example `attention_probs` likely be `ut` in the equation similarly `attention_mul`==`v^T*ut` .etc. I am also learning attention model and yours is the most intutive and simple explanation of attention. — Eka, Jul 12 '19 at 02:15

MJeremy · Answer 3 · 2018-05-15T06:24:57.883

Attention mechanism pays attention to different part of the sentence:

activations = LSTM(units, return_sequences=True)(embedded)

And it determines the contribution of each hidden state of that sentence by

Computing the aggregation of each hidden state attention = Dense(1, activation='tanh')(activations)
Assigning weights to different state attention = Activation('softmax')(attention)

And finally pay attention to different states:

sent_representation = merge([activations, attention], mode='mul')

I don't quite understand this part: sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

To understand more, you can refer to this and this, and also this one gives a good implementation, see if you can understand more on your own.

score 1 · Answer 4 · answered Dec 09 '20 at 14:25

While many good alternatives are given, I have tried to modify the code YOU have shared to make it work. I have also answered your other query that has not been addressed so far:

Q1. Is this the correct way to do it? The attention layer itself looks good. No changes needed. The way you have used the output of the attention layer can be slightly simplified and modified to incorporate some recent framework upgrades.

    sent_representation = merge.Multiply()([activations, attention])
    sent_representation = Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation)

You are now good to go!

Q2. i was sort of expecting the existence of time distributed layer since attention mechanism is distributed in every time step of the RNN

No, you don't need a time distributed layer else the weights would be shared across timesteps which is not what you want.

You can refer to: https://towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e for other specific details

score 0 · Answer 5 · edited Sep 19 '20 at 16:49

I think you can try the following code to add keras self-attention mechanism with LSTM network

    from keras_self_attention import SeqSelfAttention

    inputs = Input(shape=(length,))
    embedding = Embedding(vocab_size, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False)(inputs)
    lstm = LSTM(num_lstm, input_shape=(X[train].shape[0], X[train].shape[1]), return_sequences=True)(embedding)
    attn = SeqSelfAttention(attention_activation='sigmoid')(lstm)
    Flat = Flatten()(attn)
    dense = Dense(32, activation='relu')(Flat)
    outputs = Dense(3, activation='sigmoid')(dense)
    model = Model(inputs=[inputs], outputs=outputs)
    model.compile(loss='binary_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=10, batch_size=32,  validation_data=(X_val,y_val), shuffle=True)

How to add an attention mechanism in keras?

5 Answers5

Linked