It may be a late answer for you but for future reference, here I'm providing a starter code basis according to your imagination model alike. Currently, there are three built-in attention layers, namely
- MultiHeadAttention layer
- Attention layer (a.k.a. Luong-style attention)
- AdditiveAttention layer (a.k.a. Bahdanau-style attention)
For the starter code, we'll be using Luong-style in the encoder part and Bahdanau-style attention mechanism in the decoder part. The overall autoencoder architecture would be
a. encoder: input -> embedding -> gru -> luong-style-attn
b. decoder: input -> lstm -> bahdanau-style-attn -> gap -> classifier
↓_____________________
# whole model
autoencoder: encoder + decoder
Let's build the model accordingly.
Encoder
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import *
from tensorflow.keras import backend
from tensorflow.keras import utils
backend.clear_session()
# int sequences.
enc_inputs = Input(shape=(20,), name='enc_inputs')
# Embedding lookup and GRU
embedding = Embedding(input_dim=100, output_dim=64)(enc_inputs)
whole_sequence = GRU(4, return_sequences=True)(embedding)
# Query-value attention of shape [batch_size, Tq, filters].
query_value_attention_seq = Attention()([whole_sequence, whole_sequence])
# build encoder model
encoder = Model(enc_inputs, query_value_attention_seq, name='encoder')
checking the layout.
utils.plot_model(encoder, show_shapes=True)

Decoder
# int sequences.
dec_input = Input(shape=(20, 4), name='dec_inputs')
# LSTM
whole_sequence = LSTM(4, return_sequences=True)(dec_input)
# Query-value attention of shape [batch_size, Tq, filters].
query_value_attention_seq = AdditiveAttention()([whole_sequence, dec_input])
# Reduce over the sequence axis to produce encodings of shape
# [batch_size, filters].
query_value_attention = GlobalAveragePooling1D()(query_value_attention_seq)
# classification
dec_output = Dense(1, activation='sigmoid')(query_value_attention)
# build decoder model
decoder = Model(dec_input, dec_output, name='decoder')
checking the layout.

Autoencoder
# encoder
encoder_init = Input(shape=(20, ))
encoder_output = encoder(encoder_init); print(encoder_output.shape)
# decoder
decoder_output = decoder(encoder_output); print(decoder_output.shape)
# bind all: autoencoder
autoencoder = Model(encoder_init, decoder_output)
# check layout
utils.plot_model(autoencoder, show_shapes=True, expand_nested=True)

Dummy Training
x_train = np.random.randint(0, 10, (100,20)); print(x_train.shape)
y_train = np.random.randint(2, size=(100, 1)); print(y_train.shape)
(100, 20)
(100, 1)
autoencoder.compile('adam', 'binary_crossentropy')
autoencoder.fit(x_train, y_train, epochs=5, verbose=2)
Epoch 1/5
4/4 - 4s - loss: 0.6674
Epoch 2/5
4/4 - 0s - loss: 0.6637
Epoch 3/5
4/4 - 0s - loss: 0.6600
Epoch 4/5
4/4 - 0s - loss: 0.6590
Epoch 5/5
4/4 - 0s - loss: 0.6571
Resources
Also, you can read out my other answer regarding the attention mechanism.
And this one is my favorite about the multi-head transformer, it's a video of 3 series.